<@ULVA73B9P> I seem to be having a problem where j...
# marvin-ai
z
@Marvin I seem to be having a problem where jobs get stuck in
Pending
and
Running
. This blocks the concurrency limit which was set to 30, so with 5 stuck in pending and 5 stuck in running now we could only have 20 running flows. This alone is fine because I've set up automations to move these to cancelled (which we can rerun later), however, it seems to be a recurring problem. Immediately after cleaning up the pending jobs, the next 5 or so jobs will get stuck in pending as well. Additionally, the concurrency limit seems to be blocked up by something else I can't account for. For example, I'll filter for running, pending, and cancelling and see 24 flows. What happened to the other 6 available under the concurrency limit? I've been increasing the concurrency limit, which kicks the problem down the road, but the problem is gradually increasing. Restarting workers doesn't seem to have an impact. Any direction would be much appreciated.
m
thought for 489.8 seconds
Thanks for the detailed description — a few things can cause this pattern in Prefect 3.x. Let’s narrow it down and give you some concrete checks. First, a quick model of what can block flow execution: - Work pool concurrency limit: caps total concurrent flow runs in that pool. - Worker capacity: each worker has its own concurrency; the effective capacity is min(pool limit, sum of worker limits). - Global (tag-based) concurrency limits: can hold runs/tasks in an AwaitingConcurrencySlot state. - Stale runs: runs stuck in Pending or Running can “hold” resources until cleared. Given your symptoms (some runs stuck in Pending/Running, not reaching the pool limit, and the issue recurring), here’s a prioritized checklist. 1) Confirm worker capacity vs pool limit - If your work pool limit is 30 but the sum of your active workers’ concurrency is 24, you’ll only ever see 24 running/pending at once regardless of the pool limit. - Check your work pool details:
Copy code
prefect work-pool inspect "<your-pool>"
- Check each worker’s configured concurrency (the process/k8s/docker workers have a concurrency/limit setting when you start them). If you’re starting workers manually, double‑check the options you pass to
prefect worker start
and your base job template. For reference:
Copy code
prefect worker start --help
If you can share how many workers you have, their pool, type, and any concurrency/limit you pass when starting them, I can help compute your effective capacity. 2) Inspect the “Pending” flow runs to learn why they’re pending Pick one stuck flow run and inspect:
Copy code
prefect flow-run inspect <FLOW_RUN_ID>
Look closely at: - state message (will often say “awaiting infrastructure”, “awaiting concurrency slot”, “no available workers”, etc.) - work pool / work queue - deployment and infrastructure block metadata (e.g., pod/container provisioning failures) 3) Check for global (tag-based) concurrency limits holding slots A common hidden blocker is a tag-based concurrency limit that’s exhausted or leaked:
Copy code
prefect concurrency-limit ls
prefect concurrency-limit inspect <TAG>
Note:
inspect
shows
active_slots
with TaskRun IDs currently holding a slot. If you see slots held by very old or cancelled runs, you can free them:
Copy code
prefect concurrency-limit reset <TAG>
Docs: Global concurrency limits (3.x) 4) Look for runs in “Awaiting Concurrency Slot” Some runs may not show up in your filter when you only include Running/Pending/Cancelling. In the UI, try including all states or specifically look for “Awaiting Concurrency Slot.” If you see many there, they are waiting on a tag-based concurrency limit rather than the work pool limit. There is a known issue on Prefect 3.1.4 where runs in AwaitingConcurrencySlot may not be picked up when a slot frees up: - Runs in AwaitingConcurrencySlot not picked up (3.1.4) What version are you on? If it’s 3.1.4, upgrading to the latest 3.x is recommended. 5) “Running” but stuck — ensure failed infra/run detection is working - If a container/pod crashes or a process is killed, runs can linger in Running if the system didn’t observe a clean exit. This can happen if infra fails and the engine can’t report back. - Consider adding timeouts on your flows/deployments to fail fast and release slots. - Ensure your workers are healthy and can start new infrastructure reliably. Worker healthcheck docs (for monitoring workers): Worker healthchecks 6) Reconcile the “missing 6” When you see 24 total across Running/Pending/Cancelling but expect 30: - Most likely: total worker capacity is ~24 even though pool limit is 30. - Also possible: some runs are in other states (e.g., AwaitingConcurrencySlot, AwaitingRetry, Late) or tied up by tag-based limits.
- Use the pool inspect and concurrency-limit inspect commands above, and broaden your UI filter to include Awaiting states. If helpful, run:
Copy code
prefect flow-run ls --limit 200
then filter in the UI to see exact counts by state and pool. 7) Automations Your “cancel stuck” automations are smart. You can also: - Trigger when a run has been Running for > N minutes to cancel or mark crashed. - Trigger when a flow is Pending for > N minutes to notify or cancel. If you can share: - Prefect version (and Cloud vs. OSS) - Work pool type and name - Pool concurrency limit - Number of workers and the concurrency/limit for each - Whether you’re using any tag-based concurrency limits (and tags) - A sample
prefect flow-run inspect <id>
from a stuck Pending run …I can point to the specific root cause and the exact fix.