Mitch
09/03/2025, 2:24 PMNate
09/03/2025, 2:29 PMMitch
09/03/2025, 2:52 PMMitch
09/03/2025, 2:52 PMMitch
09/03/2025, 3:08 PMSep 03 14:05:41 sradock03 prefect[2672606]: <frozen runpy>:128: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
Nate
09/03/2025, 3:09 PMMitch
09/03/2025, 3:10 PMNate
09/03/2025, 3:12 PMWe are seeing increased latency when starting multiple prefect jobs when submitting on the sub minute level (or running every minute). In Prefect 1 we never experienced this issue, but now are seeing more instances of conflicting runs starting anywhere from 1 to 3 minutes late. Has anyone had success in improving this? We have unlimited concurrency on our work pools
That worker is running an old version 2.10.20
We are using prefect cloud
Marvin
09/03/2025, 3:12 PMMarvin
09/03/2025, 3:22 PMexport PREFECT_WORKER_QUERY_SECONDS=5
export PREFECT_WORKER_PREFETCH_SECONDS=60
# optional
export PREFECT_WORKER_HEARTBEAT_SECONDS=10
- Then restart your worker:
prefect worker start -p <your-work-pool-name> -n <your-worker-name>
- Tradeoff: Lower poll intervals increase API chatter; keep it to a handful of workers.
- Run 2–3 workers against the same pool
- This parallelizes submissions when many runs trigger at the same time.
- Keep the same settings above on each worker.
- Verify there’s no other concurrency bottleneck
- In Cloud UI: check the deployment(s) for per-deployment concurrency limits and tag-based concurrency rules.
- Work pool shows “unlimited”, but tag/deployment constraints can still defer runs.
- Confirm current settings on the worker
- On the worker machine/container:
prefect config view --show-defaults
- Look for the “worker” section (query_seconds, prefetch_seconds, heartbeat_seconds). Share that snippet if you can.
ECS-on-EC2 specific improvements
- Keep capacity warm
- Ensure your EC2 Auto Scaling Group has headroom or a warm pool so new tasks don’t wait for instance scale-up.
- Use capacity providers with target capacity > 0 to keep a baseline of ready capacity.
- Reduce or skip image pulls
- Use ECS Agent setting ECS_IMAGE_PULL_BEHAVIOR=prefer-cached
on your EC2 hosts so tasks reuse cached images.
- Keep images slim and immutable tags stable to avoid re-pulls; consider a small “image warmer” workload to keep your image cached on hosts.
- Reduce container boot time
- Minimize image size, entrypoint work, dependency install on start, and any external calls (e.g., boto credential fetch patterns) during cold start.
Upgrade recommendation
- Your worker is on 2.10.20, which is quite old. There have been numerous improvements to worker polling and config since then. I recommend:
- Short term: Upgrade to the latest 2.x worker to keep your existing deployments stable.
- Better: Move to Prefect 3.x workers when feasible — 3.x exposes clear worker settings like query_seconds and prefetch_seconds and includes reliability/performance improvements.Marvin
09/03/2025, 3:22 PMprefect config view --show-defaults
for the worker “worker.*” section
…I can pinpoint whether the slippage is primarily worker polling, submission throughput, or ECS startup, and give you an exact set of changes.Mitch
09/03/2025, 3:37 PMprefect worker start -t docker -p ny4-worker-a --install-policy if-not-present
to start the worker. We are going to upgrade the prefect version and add another worker (we have two as of now) to help reduce this. Here are the worker env variables
PREFECT_WORKER_HEARTBEAT_SECONDS='30.0' (from defaults)
PREFECT_WORKER_PREFETCH_SECONDS='10.0' (from defaults)
PREFECT_WORKER_QUERY_SECONDS='10.0' (from defaults)Marvin
09/03/2025, 3:37 PMMarvin
09/03/2025, 3:39 PMprefect worker start -t docker -p ny4-worker-a --install-policy if-not-present
And your current worker settings are defaults:
- PREFECT_WORKER_QUERY_SECONDS=10
- PREFECT_WORKER_PREFETCH_SECONDS=10
- PREFECT_WORKER_HEARTBEAT_SECONDS=30
Given your every-minute/sub-minute cadence and bursts at the same second, the biggest wins for reducing 1–3 minute slippage are:
1) Tighten polling and increase prefetch
- This reduces pickup jitter and gives the worker time to get containers started before the scheduled minute.
- Try:
export PREFECT_WORKER_QUERY_SECONDS=5
export PREFECT_WORKER_PREFETCH_SECONDS=60
export PREFECT_WORKER_HEARTBEAT_SECONDS=10
Then restart each worker:
PREFECT_WORKER_QUERY_SECONDS=5 PREFECT_WORKER_PREFETCH_SECONDS=60 PREFECT_WORKER_HEARTBEAT_SECONDS=10 \
prefect worker start -t docker -p ny4-worker-a -n ny4-worker-a-1
Notes:
- A larger prefetch window can start runs slightly early (by up to the prefetch value). If that’s not acceptable, use 20–30s instead of 60s.
- Lower polling means more frequent API calls; keep it reasonable (3–5s is a good target).
2) Add more workers for submission parallelism
- You’re adding another worker — good. I’d run at least 3–4 workers for a pool that fans out several runs at the same second. This spreads submissions and container launches.
- Keep the same PREFECT_WORKER_*
settings on each worker.
3) Make Docker startup as fast as possible
- Ensure your deployment’s Docker infrastructure uses no/limited pulls at runtime:
- Set image_pull_policy to IF_NOT_PRESENT (or NEVER if you manage cache out-of-band) so each run doesn’t re-pull.
- Pre-pull the image on the worker host(s) and keep it warm:
docker pull your-registry/your-image:tag
- Keep tags immutable so the Docker cache is actually reusable.
- Keep images slim and reduce boot work in entrypoints to cut cold-start time.
4) Stagger schedules to avoid a single-second “thundering herd”
- If multiple deployments fire at :00, add 1–10s offset or jitter to spread load across the minute.
5) Double-check there aren’t hidden concurrency limits
- Pool shows “unlimited”, but per-deployment or tag concurrency limits can still delay runs. Quick check those in the UI.
If you’re upgrading Prefect:
- Short term: upgrading your 2.10.20 Docker workers to a recent 2.x will help with worker reliability and settings behavior.
- Medium term: 3.x has ongoing improvements and clearer worker settings. If/when you move to 3.x, remember:
- Use prefect deploy
(not prefect deployment build
) and flow.from_source(...).deploy(...)
in code.
- Settings reference: 3.x Settings and worker healthchecks: 3.x Worker healthchecks
- 2.x work pools/workers docs (for context): 2.x Work pools and workers
Would you be okay with runs starting slightly early (e.g., up to 30–60s) if it eliminates the late starts? If yes, we’ll keep PREFETCH at 60; if not, we’ll tune it to ~20–30s and lean more on worker count and image caching.
If you can share:
- The deployment’s Docker image and pull policy settings
- Whether multiple deployments hit the same second
- A representative run where “Scheduled → Submitted/Running” shows the delay
…I can fine-tune the exact values (prefetch/polling and worker count) for your load profile.Nate
09/03/2025, 3:40 PMPREFECT_WORKER_QUERY_SECONDS
is the right move hereNate
09/03/2025, 3:41 PMNate
09/03/2025, 3:42 PMMitch
09/03/2025, 3:44 PMNate
09/03/2025, 3:46 PMI think upgrading version then adding one more worker we can see if that helps?im not sure that adding a worker would help. is the worker having trouble submitting containers in time or are the containers slow to begin their flow run processes? if the latter, adding the worker doesn't seem like itd help
Could be some latency with the git clone as well through VPN and such which contribute to initialization?yea could be. hmm
Mitch
09/03/2025, 4:10 PM{
"id": "068b84aa-c31a-7d34-8000-58bae7094837",
"account": redacted,
"event": "prefect.flow-run.Late",
"occurred": "2025-09-03T14:03:24.194Z",
"payload": {
"intended": {
"from": "SCHEDULED",
"to": "SCHEDULED"
},
"initial_state": {
"type": "SCHEDULED",
"name": "Scheduled",
"message": "Flow run scheduled",
"timestamp": "2025-09-03T14:00:13.348186+00:00",
"id": "068b849e-d591-7f36-8000-99fd248e2de8"
},
"validated_state": {
"type": "SCHEDULED",
"name": "Late",
"timestamp": "2025-09-03T14:03:24.194071+00:00",
"id": "068b84aa-c31a-7d34-8000-58bae7094837"
}
},
"received": "2025-09-03T14:03:24.437Z",
"related": [
{
"prefect.resource.id": "prefect.flow.7097c204-5aa4-467c-ab0b-00b2bde718c0",
"prefect.resource.role": "flow",
"prefect.resource.name": "executions-strategy-intraday"
},
{
"prefect.resource.id": "prefect.deployment.965785f9-bcb5-436f-bca5-591aecb78f01",
"prefect.resource.role": "deployment",
"prefect.resource.name": "executions-strategy-intraday",
"prefect.deployment.version-id": "068b7425-f2b4-7bf5-8000-2cbc10c14534",
"prefect.deployment.version-type": "prefect:simple",
"prefect.deployment.version": ""
},
{
"prefect.resource.id": "prefect.work-queue.8868d7c0-be7c-432e-bbfe-6c84ab546d5c",
"prefect.resource.role": "work-queue",
"prefect.resource.name": "default"
},
{
"prefect.resource.id": "prefect.work-pool.a39b04c4-0c5f-46c6-aaf6-2d1b8029206f",
"prefect.resource.role": "work-pool",
"prefect.resource.name": "ny4-worker-a",
"prefect.work-pool.type": "docker"
},
{
"prefect.resource.id": "prefect.tag.auto-scheduled",
"prefect.resource.role": "tag"
},
{
"prefect.resource.id": "prefect.schedule.a016c5aa-7b0c-4b50-8505-4a794b3df5c6",
"prefect.resource.role": "creator",
"prefect.resource.name": "CronSchedule"
}
],
"resource": {
"prefect.resource.id": "prefect.flow-run.068b849e-d591-7e93-8000-c30ec2d987e1",
"prefect.resource.name": "rainbow-beetle",
"prefect.run-count": "0",
"prefect.state-message": "",
"prefect.state-name": "Late",
"prefect.state-timestamp": "2025-09-03T14:03:24.194071+00:00",
"prefect.state-type": "SCHEDULED"
},
"workspace": redacted
}
Mitch
09/03/2025, 4:11 PMMitch
09/09/2025, 1:54 PMNate
09/09/2025, 1:59 PMNate
09/09/2025, 2:00 PMMitch
09/09/2025, 2:05 PM