<@ULVA73B9P> I'm finding I have a lot (400+) of jo...
# ask-marvin
b
@Marvin I'm finding I have a lot (400+) of jobs that are stuck in pending from various dates over the last 3 weeks, how would those impact performance of the system. I currently see that it eats up concurrency slots as no jobs would start even with our 300 job limit.
m
thought for 830.6 seconds
Short answer: In Prefect 3.x, PENDING flow runs do not consume work pool or global/task concurrency slots. Only actively executing runs (e.g., RUNNING/SUBMITTED) should count. A large backlog of PENDING runs mainly creates UI noise and a bit of scheduler overhead, but it should not block new runs by itself. If nothing is starting despite a 300 limit, something else is likely blocking (e.g., workers offline, a work pool/deployment/global/tag concurrency limit at capacity, or stuck leases). Quick checks to diagnose - Workers online and polling: - In the UI, open your Work Pool > Workers and check “Last heartbeat”. - CLI:
Copy code
prefect work-pool inspect "<your-pool>"
- Work pool concurrency: - Confirm the pool has the expected limit and open slots:
Copy code
prefect work-pool inspect "<your-pool>"
    # Adjust if needed
    prefect work-pool set-concurrency-limit "<your-pool>" 300
- Docs: Work pools - Global concurrency limits (can block all work across pools):
Copy code
prefect global-concurrency-limit ls
  prefect global-concurrency-limit inspect "<name>"
- Docs: Global concurrency limits - Tag-based task/flow concurrency limits (stuck slots):
Copy code
prefect concurrency-limit ls
  prefect concurrency-limit inspect "<tag>"
  # if you find orphaned slots from crashed runs
  prefect concurrency-limit reset "<tag>"
- Are runs actually PENDING, or are there many RUNNING/SUBMITTED elsewhere?
Copy code
prefect flow-run ls --state RUNNING --limit 1000
  prefect flow-run ls --state SUBMITTED --limit 1000
If 300 runs are RUNNING/SUBMITTED, your limit is being used by those. Clearing a backlog of stale PENDING runs - Cancel them (safe and reversible state change): - UI: filter by state = PENDING and cancel in bulk (if applicable in your UI). - CLI (one-off IDs):
Copy code
prefect flow-run cancel <flow-run-id>
- Docs: Cancel workflows - If you need to clean up in bulk from the CLI, you can list and then cancel by ID. Since the CLI does not (yet) provide a native bulk-cancel, most teams either use the UI or a short script with the API client. If you’d like, I can draft a small Python script to cancel all PENDING runs older than X days. Why you may see “no jobs starting” - Workers are offline: the scheduler will queue runs as PENDING but nothing can start until a worker polls. - Concurrency limits are fully consumed: work pool limit, a deployment-level limit (if set), a global concurrency limit, or a tag-based limit could be at capacity. - Stuck leases: orphaned slots from previously crashed runs can keep concurrency “full” until reset. - Work pool paused: verify the pool is not paused. Next steps / info I need to help pinpoint this - Which concurrency “300 job limit” are you referring to: work pool concurrency, global concurrency, or a tag/deployment limit? - What worker type(s) are you running (Kubernetes, Docker, Process), and are they showing as online with recent heartbeats? - Do you have any global or tag-based concurrency limits configured? - Are the 400+ runs truly PENDING, or are there many RUNNING/SUBMITTED runs as well? If you share the work pool name(s) and which limit you’re looking at, I can give you exact commands to inspect and, if needed, a safe script to cancel just the stale PENDING runs.
b
@Marvin I don't think you are correct about the pending jobs, as after I deleted them all the flows started running again, what could explain this?
m
thought for 132.9 seconds
Thanks for the direct feedback — and you’re right that there are cases where a backlog of PENDING runs can effectively block new work. A few common explanations for what you observed: 1) Head-of-line blocking in the work queue - Workers fetch runs in order (oldest first). If the oldest PENDING runs can’t advance (e.g., blocked by a tag/global concurrency limit, bad infra config, or a pool with no matching workers), the worker keeps bumping into those first and never gets to newer runs that could start. - Deleting those stale PENDING runs exposes runnable work further down the queue, so things “unclog” and start immediately. 2) Claimed-but-stuck runs consuming pool capacity - When a worker claims runs from a work pool, the pool will not hand out more work for those slots until those runs progress or the lease expires. If something fails before SUBMITTED/RUNNING (infra error, worker crash), those runs can stay PENDING yet still tie up the pool’s available capacity for a while. - Deleting those runs clears the blockage instantly and frees capacity. 3) Concurrency limits at capacity + old runs keep being retried - Global/tag-based concurrency can be “full” due to stuck leases from crashed runs. If many old PENDING runs share that tag, they’ll keep getting picked first and rejected by the server, starving other work. - Resetting the tag/global limit or removing the stale PENDING runs allows new runs to proceed. How to verify what’s happening in your environment - Inspect the pool and workers (capacity, heartbeats, queue preview):
Copy code
prefect work-pool inspect "<pool>"
  prefect work-pool preview "<pool>" --hours 24
Docs: Work pools - See what’s actually running vs pending:
Copy code
prefect flow-run ls --state RUNNING --state SUBMITTED --limit 1000
  prefect flow-run ls --state PENDING --limit 1000
- Check for global or tag-based concurrency limits that might be full or stuck:
Copy code
prefect global-concurrency-limit ls
  prefect concurrency-limit ls
  # Inspect a tag and reset if you find orphaned slots
  prefect concurrency-limit inspect "<tag>"
  prefect concurrency-limit reset "<tag>"
Docs: Global concurrency limits - Worker health: - In the UI: Work Pool > Workers; verify recent heartbeats. - Worker logs: look for repeated attempts to pick the same runs or infra errors that prevent SUBMITTED/RUNNING. Preventing a recurrence - Keep queues/pools segmented so one problematic deployment/tag doesn’t block everything else. - Add an automation or periodic script to cancel PENDING runs older than N hours if they’re known to be stale. - Docs: Cancel workflows - If you’re using tag/global limits, monitor and occasionally reset stuck leases if you see crashes. If you can share: - Which “300 job limit” this is (work pool limit, global concurrency limit, or something else) - Your worker type(s) and whether they showed recent heartbeats - Whether you use tag/global concurrency limits I can help pinpoint which of the above caused the blockage and provide a small script to safely cancel stale PENDING runs older than X days.