<@ULVA73B9P> i have so many runs in late state, an...
# ask-marvin
r
@Marvin i have so many runs in late state, and I have 8 prefect workers of type process with worker concurrency limit of 10 defined, even then I see most of the time just about 1 or 2 flow runs gets picked up in a worker
m
thought for 308.6 seconds
Thanks for the details — assuming you’re on Prefect 3.x. Seeing lots of flow runs in Late while workers only pick up 1–2 at a time usually points to a concurrency gate upstream of the worker or to polling/capacity settings. Here’s a quick checklist to find the bottleneck: Top things to check - Work pool concurrency limit - If the pool has a low concurrency limit, all workers in that pool will collectively be capped at that number. - Inspect it:
Copy code
prefect work-pool inspect "<your-pool>"
- If needed, raise it:
Copy code
prefect work-pool set-concurrency-limit "<your-pool>" 80
- Docs: Work pools - Deployment-level concurrency_limit - A deployment can cap its own concurrent runs; workers won’t exceed this even if they have capacity. - Inspect a deployment:
Copy code
prefect deployment inspect "flow-name/deployment-name"
- Look for
concurrency_limit
in the output. - Global/tag-based concurrency limits - Tag-based/global limits can throttle both flow and task runs if tags on your runs are bound to a limit. - List limits:
Copy code
prefect concurrency-limit ls
- Inspect a specific tag:
Copy code
prefect concurrency-limit inspect "<tag>"
- Docs: - Global concurrency limits - Tag-based concurrency limits - Worker polling interval and heartbeat - If workers poll infrequently, pickup will be slow and Late can accumulate. - Check these env vars on your worker processes: -
PREFECT_WORKER_QUERY_SECONDS
(how often workers poll; default ~15s) -
PREFECT_WORKER_HEARTBEAT_SECONDS
(default ~30s) - Docs: Worker health checks - Work queues and pausing - Ensure workers are pointing at the right pool/queues and none are paused. - If you’re using
--work-queue
, confirm those queues actually have the Late runs. - Worker capacity vs machine capacity - Process workers execute flows locally; even if concurrency is 10, the host may be CPU/RAM constrained. Check worker logs for failures to start or repeated retries; also monitor system resource usage. What I’d look at next (quick data to share) 1) Output of:
Copy code
prefect work-pool inspect "<your-pool>"
   prefect concurrency-limit ls
2) One example deployment’s details:
Copy code
prefect deployment inspect "flow-name/deployment-name"
3) How you’re starting workers (the exact command and any env vars like
PREFECT_WORKER_QUERY_SECONDS
). 4) A few lines of worker logs around polling/claiming. If possible, increase verbosity:
Copy code
PREFECT_LOGGING_LEVEL=DEBUG prefect worker start -p "<your-pool>"
Common resolutions - Pool limit was low — raise it to match desired aggregate capacity. - Deployment
concurrency_limit
was set (e.g., 2) — increase or remove it. - A tag (e.g., “prod”) had a global concurrency limit — increase/reset:
Copy code
prefect concurrency-limit inspect "prod"
  prefect concurrency-limit reset "prod"
- Workers were polling too slowly — reduce
PREFECT_WORKER_QUERY_SECONDS
(e.g., to 5–10s) to speed pickup. - Workers pointed at the wrong queue(s) or queues were paused — fix targeting or resume. If you can share the inspect outputs and your worker start command, I can pinpoint the constraint quickly.
r
@Marvin Deployment concurrency isn't set, no tag concurrency is set, work pool concurrency is 80, polling frequency is default
m
thought for 203.1 seconds
Thanks — with deployment/tag limits off, pool concurrency at 80, and default polling, the next most common culprits are: 1) Per-worker concurrency not actually set on the worker - Process workers have their own max concurrent flow runs. Make sure each worker is started with a limit, e.g.:
Copy code
prefect worker start -p "<your-pool>" --limit 10
- If you don’t pass
--limit
, a process worker won’t run many flows concurrently. Can you share the exact command you use to start the workers? 2) Work-queue concurrency or queue targeting - Even if the pool is 80, a specific work queue can have its own concurrency limit, or you might be pointing workers at queues with very few runs. - Check queues and their limits:
Copy code
prefect work-queue ls
  prefect work-queue inspect <queue-id>
- If you’re using
--work-queue
on workers, confirm these are the queues where your Late runs actually are. Otherwise omit
--work-queue
so workers poll all queues in the pool. 3) Worker polling/health vs claim rate - Default polling is fine for most, but when you’re trying to drain a large Late backlog, reducing the interval can help with pickup speed:
Copy code
PREFECT_WORKER_QUERY_SECONDS=5 prefect worker start -p "<your-pool>" --limit 10
- Docs: Worker health checks 4) Host resource limits - Process workers spawn one OS process per flow run. If the host is CPU/RAM constrained or has low
ulimit -n
, they may only practically run 1–2 concurrently. Check system metrics and worker logs for failures to start/claim more runs. Quick diagnostics to run - Verify per-worker limit and queue targeting in the worker start commands you use. - Inspect queues for unexpected concurrency limits or pausing:
Copy code
prefect work-queue ls
  prefect work-queue inspect <queue-id>
- Turn on debug logs for one worker to see claim decisions:
Copy code
PREFECT_LOGGING_LEVEL=DEBUG prefect worker start -p "<your-pool>" --limit 10
If you share: - The exact worker start command you’re using - Output of
prefect work-queue ls
and an
inspect
for the queue(s) - A short DEBUG log snippet around polling/claiming …I can pinpoint why only 1–2 runs are being picked up and suggest the precise fix. Links for reference: - Work pools: Work pools - Global/tag concurrency: Global concurrency limits and Tag-based concurrency limits
r
@Marvin Weirdly rare occasions i do see more than 7-8 running on the same worker But most of the time it is less
m
thought for 66.6 seconds
Got it — that pattern (sometimes 7–8 on one worker, most of the time fewer) usually means the worker isn’t consistently seeing enough eligible runs in the queues it polls at the moment it polls. A few things to verify that often explain the variability: What to check next - Are the Late runs actually in the queues your workers are polling? - If you start workers with
--work-queue
, they’ll only see runs in those queues. If your backlog is spread across multiple queues, most polls won’t find many eligible runs. - Quick checks:
Copy code
prefect work-queue ls
    prefect work-queue preview -p "<your-pool>"
    prefect work-queue inspect <queue-id>
    prefect work-queue read-runs <queue-id>
- If unsure, try starting a worker without
--work-queue
so it polls all queues in the pool:
Copy code
prefect worker start -p "<your-pool>" --limit 10
- Per-worker limit and actual worker start command - Confirm each worker is actually started with
--limit 10
(or a pool variable that sets the same). Without it, concurrency may be much lower or inconsistent. - Please share the exact worker start command you use. - Polling cadence vs. other workers claiming first - With default polling, a worker may miss windows where many runs become eligible if other workers poll first. You can reduce poll interval temporarily to see if pickup improves:
Copy code
PREFECT_WORKER_QUERY_SECONDS=5 prefect worker start -p "<your-pool>" --limit 10
- Docs: Worker health checks - Queue-level concurrency or pausing - Pools can be wide open (80), but individual queues can still be limited or paused. Check:
Copy code
prefect work-queue ls
    prefect work-queue inspect <queue-id>
- If limits exist on specific queues, raise or clear them. - Host resource pressure - Process workers spawn one process per flow run. If the host is CPU/RAM constrained, you’ll see effective concurrency drop below your
--limit
. Monitor CPU/memory and worker logs while the worker is “stuck” at 1–2. Highly informative debug logs Run one worker at DEBUG to see why it isn’t claiming more (e.g., “no eligible runs”, “max concurrency reached”, etc.):
Copy code
PREFECT_LOGGING_LEVEL=DEBUG prefect worker start -p "<your-pool>" --limit 10
If you can share: - Your worker start command(s) - Output of
prefect work-queue ls
and
prefect work-queue preview -p "<your-pool>"
- A short DEBUG log snippet around polls and claims …I can pinpoint whether this is queue targeting, poll timing, resource limits, or something else. Links for reference: - Work pools and queues: Work pools - Concurrency limits: Global concurrency limits and Tag-based concurrency limits
r
@Marvin sometimes I see flows not getting picked for 2 minutes
m
thought for 55.4 seconds
A 2-minute gap typically points to the worker’s polling/backoff behavior rather than a hard concurrency gate. What’s likely happening - When a worker polls and finds no (or few) eligible runs in the queues it’s watching, it backs off (exponential with jitter). After a few “empty” polls the sleep can reach ~1–2 minutes before the next claim attempt. This creates the “sometimes I get 7–8, but often fewer and occasionally nothing for ~2 minutes” pattern—especially if your Late backlog is spread across multiple queues or becomes eligible in bursts. How to confirm - Run one worker with DEBUG logging and watch for messages about no eligible runs and sleep/backoff:
Copy code
PREFECT_LOGGING_LEVEL=DEBUG prefect worker start -p "<your-pool>" --limit 10
You’ll see logs like “No eligible flow runs found... sleeping for X seconds” if backoff is the cause. - Preview and read runs in the queues to verify eligibility at the time of polling:
Copy code
prefect work-queue preview -p "<your-pool>"
  prefect work-queue ls
  prefect work-queue inspect <queue-id>
  prefect work-queue read-runs <queue-id>
Practical mitigations - Reduce the polling interval to cut the worst-case quiet period:
Copy code
PREFECT_WORKER_QUERY_SECONDS=5 prefect worker start -p "<your-pool>" --limit 10
Docs: Worker health checks - Ensure each worker has a per-worker limit set (you mentioned 10; just double-check the actual start command):
Copy code
prefect worker start -p "<your-pool>" --limit 10
- Broaden queue targeting if you’re using
--work-queue
and your backlog is split across queues. Consider removing
--work-queue
so workers pull from all queues in the pool, or consolidate runs into fewer queues. - Optional: increase prefetch to catch runs slightly before their scheduled time (helps with bursty schedules):
Copy code
prefect worker start -p "<your-pool>" --limit 10 --prefetch-seconds 60
- Check host resources. If the machine is CPU/RAM constrained, a worker may struggle to actually start more processes even if it claims them. If you can share: - The exact worker start command - Output of
prefect work-queue ls
and
prefect work-queue preview -p "<your-pool>"
- A short DEBUG log snippet showing a 2-minute gap …I can confirm whether backoff is the root cause or if something else (queue limits/targeting or resource pressure) is in play.