Zach Chin
08/28/2025, 7:19 PMPending
and Running
. This blocks the concurrency limit which was set to 30, so with 5 stuck in pending and 5 stuck in running now we could only have 20 running flows. This alone is fine because I've set up automations to move these to cancelled (which we can rerun later), however, it seems to be a recurring problem. Immediately after cleaning up the pending jobs, the next 5 or so jobs will get stuck in pending as well.
Additionally, the concurrency limit seems to be blocked up by something else I can't account for. For example, I'll filter for running, pending, and cancelling and see 24 flows. What happened to the other 6 available under the concurrency limit? I've been increasing the concurrency limit, which kicks the problem down the road, but the problem is gradually increasing. Restarting workers doesn't seem to have an impact. Any direction would be much appreciated.Marvin
08/28/2025, 7:19 PMMarvin
08/28/2025, 7:28 PMprefect work-pool inspect "<your-pool>"
- Check each worker’s configured concurrency (the process/k8s/docker workers have a concurrency/limit setting when you start them). If you’re starting workers manually, double‑check the options you pass to prefect worker start
and your base job template. For reference:
prefect worker start --help
If you can share how many workers you have, their pool, type, and any concurrency/limit you pass when starting them, I can help compute your effective capacity.
2) Inspect the “Pending” flow runs to learn why they’re pending
Pick one stuck flow run and inspect:
prefect flow-run inspect <FLOW_RUN_ID>
Look closely at:
- state message (will often say “awaiting infrastructure”, “awaiting concurrency slot”, “no available workers”, etc.)
- work pool / work queue
- deployment and infrastructure block metadata (e.g., pod/container provisioning failures)
3) Check for global (tag-based) concurrency limits holding slots
A common hidden blocker is a tag-based concurrency limit that’s exhausted or leaked:
prefect concurrency-limit ls
prefect concurrency-limit inspect <TAG>
Note: inspect
shows active_slots
with TaskRun IDs currently holding a slot. If you see slots held by very old or cancelled runs, you can free them:
prefect concurrency-limit reset <TAG>
Docs: Global concurrency limits (3.x)
4) Look for runs in “Awaiting Concurrency Slot”
Some runs may not show up in your filter when you only include Running/Pending/Cancelling. In the UI, try including all states or specifically look for “Awaiting Concurrency Slot.” If you see many there, they are waiting on a tag-based concurrency limit rather than the work pool limit.
There is a known issue on Prefect 3.1.4 where runs in AwaitingConcurrencySlot may not be picked up when a slot frees up:
- Runs in AwaitingConcurrencySlot not picked up (3.1.4)
What version are you on? If it’s 3.1.4, upgrading to the latest 3.x is recommended.
5) “Running” but stuck — ensure failed infra/run detection is working
- If a container/pod crashes or a process is killed, runs can linger in Running if the system didn’t observe a clean exit. This can happen if infra fails and the engine can’t report back.
- Consider adding timeouts on your flows/deployments to fail fast and release slots.
- Ensure your workers are healthy and can start new infrastructure reliably. Worker healthcheck docs (for monitoring workers): Worker healthchecks
6) Reconcile the “missing 6”
When you see 24 total across Running/Pending/Cancelling but expect 30:
- Most likely: total worker capacity is ~24 even though pool limit is 30.
- Also possible: some runs are in other states (e.g., AwaitingConcurrencySlot, AwaitingRetry, Late) or tied up by tag-based limits.Marvin
08/28/2025, 7:28 PMprefect flow-run ls --limit 200
then filter in the UI to see exact counts by state and pool.
7) Automations
Your “cancel stuck” automations are smart. You can also:
- Trigger when a run has been Running for > N minutes to cancel or mark crashed.
- Trigger when a flow is Pending for > N minutes to notify or cancel.
If you can share:
- Prefect version (and Cloud vs. OSS)
- Work pool type and name
- Pool concurrency limit
- Number of workers and the concurrency/limit for each
- Whether you’re using any tag-based concurrency limits (and tags)
- A sample prefect flow-run inspect <id>
from a stuck Pending run
…I can point to the specific root cause and the exact fix.Zach Chin
09/16/2025, 2:59 AMeligibility-work-pool
• Concurrency limit: 50
• Number of workers: 1
◦ with concurrency limit: Unlimited
• Tag based concurrency: None
Here's a sample flow run of one of the stuck pending runs:
FlowRun(
id='34967369-1a60-4052-bc6f-94617af08665',
created=DateTime(2025, 9, 16, 2, 37, 39, 794713, tzinfo=Timezone('UTC')),
updated=DateTime(2025, 9, 16, 2, 43, 50, 271699, tzinfo=Timezone('UTC')),
name='authentic-llama',
flow_id='4df4ec49-2e48-4f6a-bde2-326d5abd0ace',
state_id=UUID('01995068-0f38-7673-b0f0-70fe9340edce'),
deployment_id=UUID('0d924e4f-8884-45fd-988d-cbdadc424594'),
deployment_version='89b61d8545f14617b2f626cbcf32b6b6',
work_queue_name='default',
parameters={
REDACTED
},
context={},
empirical_policy=FlowRunPolicy(pause_keys=set()),
tags=[],
labels={
'prefect.flow.id': '4df4ec49-2e48-4f6a-bde2-326d5abd0ace',
'__OTEL_TRACEPARENT': '00-f72708c0fe099f176419fab679b1e05b-8c0ec36eef267816-01',
'prefect.worker.name': 'ProcessWorker f9705262-42a9-4d8d-b376-ff1b730ca9f9',
'prefect.worker.type': 'process',
'prefect.work-pool.id': '66527145-d1dd-44bf-a489-0d2028547357',
'prefect.deployment.id': '0d924e4f-8884-45fd-988d-cbdadc424594',
'prefect.work-pool.name': 'eligibility-work-pool'
},
parent_task_run_id=UUID('d32d5127-8634-4b8d-8d59-44a74df5a905'),
expected_start_time=DateTime(2025, 9, 16, 2, 37, 39, 6893, tzinfo=Timezone('UTC')),
estimated_start_time_delta=datetime.timedelta(seconds=568, microseconds=723630),
infrastructure_pid='948',
work_queue_id=UUID('b1f00dcf-d205-4fbd-8cc3-ccf421d7258c'),
work_pool_id=UUID('66527145-d1dd-44bf-a489-0d2028547357'),
work_pool_name='eligibility-work-pool',
state=State(
id='01995068-0f38-7673-b0f0-70fe9340edce',
type=StateType.PENDING,
name='Pending',
timestamp=datetime.datetime(2025, 9, 16, 2, 43, 40, 984635, tzinfo=TzInfo(UTC)),
state_details=StateDetails(
flow_run_id=UUID('34967369-1a60-4052-bc6f-94617af08665'),
task_run_id=UUID('d32d5127-8634-4b8d-8d59-44a74df5a905'),
scheduled_time=DateTime(2025, 9, 16, 2, 37, 39, 6893, tzinfo=Timezone('UTC')),
transition_id=UUID('c404f442-ce6d-4c5b-987c-3a3a93f43775')
)
),
job_variables={},
state_type=StateType.PENDING,
state_name='Pending'
)
Here is my work pool inspect output for good measure:
WorkPool(
id='66527145-d1dd-44bf-a489-0d2028547357',
created=DateTime(2025, 7, 25, 21, 14, 40, 639243, tzinfo=Timezone('UTC')),
updated=DateTime(2025, 9, 16, 2, 54, 7, 226812, tzinfo=Timezone('UTC')),
name='eligibility-work-pool',
type='process',
base_job_template={
'variables': {
'type': 'object',
'properties': {
'env': {'type': 'object', 'title': 'Environment Variables', 'description': 'Environment variables to set when starting a flow run.', 'additionalProperties': {'anyOf': [{'type': 'string'}, {'type': 'null'}]}},
'name': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Name', 'default': None, 'description': 'Name given to infrastructure created by a worker.'},
'labels': {'type': 'object', 'title': 'Labels', 'description': 'Labels applied to infrastructure created by a worker.', 'additionalProperties': {'type': 'string'}},
'command': {
'anyOf': [{'type': 'string'}, {'type': 'null'}],
'title': 'Command',
'default': None,
'description': 'The command to use when starting a flow run. In most cases, this should be left blank and the command will be automatically generated by the worker.'
},
'working_dir': {
'anyOf': [{'type': 'string', 'format': 'path'}, {'type': 'null'}],
'title': 'Working Directory',
'default': None,
'description': 'If provided, workers will open flow run processes within the specified path as the working directory. Otherwise, a temporary directory will be created.'
},
'stream_output': {'type': 'boolean', 'title': 'Stream Output', 'default': True, 'description': 'If enabled, workers will stream output from flow run processes to local standard output.'}
}
},
'job_configuration': {'env': '{{ env }}', 'name': '{{ name }}', 'labels': '{{ labels }}', 'command': '{{ command }}', 'working_dir': '{{ working_dir }}', 'stream_output': '{{ stream_output }}'}
},
concurrency_limit=50,
status=WorkPoolStatus.READY,
storage_configuration=WorkPoolStorageConfiguration(),
default_queue_id='b1f00dcf-d205-4fbd-8cc3-ccf421d7258c'
)
Marvin
09/16/2025, 2:59 AMMarvin
09/16/2025, 3:19 AMprefect.worker.name
, prefect.worker.type
) and the worker spawned a process (infrastructure_pid='948'
).
- However, it never transitioned to RUNNING; it’s stuck in PENDING.
In practice, this pattern usually means the worker successfully launched the child process, but the process never initialized the engine (e.g., hung or crashed very early before the engine registered “Running”). Typical causes:
- Import-time hangs or crashes (e.g., heavy module imports, environment initialization, deadlocks)
- Early failure in runner startup (before the engine could set state), leaving the run in PENDING
- Resource contention on the worker host when starting many processes at once
There was a recent bug fix that may be relevant: “Fix FileExistsError in runner temp directory creation” in 3.4.18. If your child process hit this edge case, it could crash before the engine started and leave the run PENDING. I’d recommend upgrading to confirm whether this resolves the issue going forward.
- Release notes: 3.4.18 release notes
Quick triage checks
1) Worker logs at the time of the run
- Please share the worker logs (around 0237–0244 UTC). Look for:
- “Submitted flow run … to infrastructure” and process start messages
- Any stack traces from the child process
- Whether the worker later reports cleanup/retries
2) Is the child process still alive?
- On the worker host, check the PID you see on the run (948
in your example):
- ps -p 948 -o pid,cmd=
- If it’s not running, the run is orphaned in PENDING (confirming early crash/hang).
3) Do all stuck runs have a parent_task_run_id?
- If it’s only child runs getting stuck (vs top-level runs), that narrows the issue to how subflows/deployments are launched or initialized.
Helpful CLI commands (verified)
- Inspect a flow run:
prefect flow-run inspect 34967369-1a60-4052-bc6f-94617af08665
- View logs for a run (often empty in this scenario, which supports “engine never started”):
prefect flow-run logs 34967369-1a60-4052-bc6f-94617af08665
- Inspect the parent task run for clues:
prefect task-run inspect d32d5127-8634-4b8d-8d59-44a74df5a905
- Cancel a stuck run so it doesn’t hold capacity:
prefect flow-run cancel 34967369-1a60-4052-bc6f-94617af08665
Recommended next steps
- Upgrade to 3.4.18 (link above) — it includes a runner startup fix that could prevent orphaned PENDING runs in exactly this phase of startup.
- Restart your worker process (to clear any odd state).
- For now, cancel any stuck PENDING runs and re-trigger them.
- If you’re launching many child runs at once, consider temporarily reducing the work pool concurrency (e.g., set it to 10–20) to rule out resource contention on a single worker host:
prefect work-pool set-concurrency-limit "eligibility-work-pool" 20
(You can restore it after testing.)
- If these child runs are created via run_deployment
or subflows, could you share a small snippet showing how they’re launched? Patterns like heavy module-level work or environment setup in the child’s code can cause early hangs.
Open questions for you
- Are top-level runs fine, and only child runs are affected?
- Do you see any errors in the worker logs around the timestamps for these PENDING runs?
- On the worker host, was PID 948 (from your example) actually alive? If not, it strengthens the “early crash before engine start” diagnosis.
- Roughly how many child runs are being launched concurrently at peak?
If you can share those details (especially logs), I can help pinpoint whether this is an environment init issue, a load/concurrency issue, or something we need to dive deeper into.Zach Chin
09/16/2025, 3:40 AMMarvin
09/16/2025, 3:40 AMMarvin
09/16/2025, 3:42 AMprefect work-pool inspect "eligibility-work-pool"
prefect work-queue ls -p "eligibility-work-pool"
prefect work-queue inspect default -p "eligibility-work-pool"
- How many PENDING/RUNNING in staging right now? If this number >= your limit, the worker won’t pick new runs.
prefect flow-run ls --state PENDING --limit 200
prefect flow-run ls --state RUNNING --limit 200
(If this returns a ton of runs, especially very old ones, it explains the “any limit blocks” behavior.)
- Preview queue eligibility (very helpful to see what the server will hand out)
prefect work-queue preview -p "eligibility-work-pool" default
prefect work-queue read-runs -p "eligibility-work-pool" default
If the diagnosis holds, clean up the stale runs and then re-apply a reasonable limit
- Cancel old PENDING runs (e.g., older than N minutes/hours) so they stop counting toward pool concurrency:
# ad hoc, one-off:
prefect flow-run cancel <run-id>
- To bulk-cancel PENDING runs on the default queue (staging), you can use this script:
import asyncio
from datetime import datetime, timedelta, timezone
from prefect.client.orchestration import get_client
from prefect.client.schemas import filters
from prefect.client.schemas.actions import StateCreate
from prefect.client.schemas.objects import StateType
# adjust as needed
QUEUE = "default"
OLDER_THAN = timedelta(hours=6)
async def main():
cutoff = datetime.now(timezone.utc) - OLDER_THAN
async with get_client() as client:
fr_filter = filters.FlowRunFilter(
state_type=filters.FlowRunFilterStateType(any_=["PENDING"]),
work_queue_name=filters.FlowRunFilterWorkQueueName(any_=[QUEUE]),
start_time=filters.FlowRunFilterStartTime(before_=cutoff),
)
runs = await client.read_flow_runs(flow_run_filter=fr_filter, limit=1000)
print(f"Found {len(runs)} old PENDING runs on queue={QUEUE}; cancelling...")
for r in runs:
try:
await client.set_flow_run_state(
r.id, state=StateCreate(type=StateType.CANCELLED, name="Cancelled (cleanup)")
)
except Exception as e:
print(f"Failed to cancel {r.id}: {e}")
asyncio.run(main())
- Then set a sane limit (start low, verify stability, then increase):
prefect work-pool set-concurrency-limit "eligibility-work-pool" 10
# or per-queue if you are using queue-level limits
prefect work-queue set-concurrency-limit default -p "eligibility-work-pool" 10
Important clarifications to help me pinpoint the gating layer
- Which limiter are you toggling when the worker stops picking: work pool concurrency limit or work queue concurrency limit?
- Roughly how many PENDING flow runs are present in staging before you set the limit?
- Are both staging and prod on Prefect 3.4.17 for the server and workers?
- When you enable the limit and the worker picks nothing, what do you see from:
- prefect work-queue preview -p eligibility-work-pool default
- prefect work-queue read-runs -p eligibility-work-pool default
On the production PENDING child runsMarvin
09/16/2025, 3:42 AMinfrastructure_pid=948
) but the engine never marked the run RUNNING. That’s typically an early-startup crash/hang. There is a recent fix that could be related: “Fix FileExistsError in runner temp directory creation” in 3.4.18 — early runner failures can leave runs in PENDING.
- I recommend upgrading server/workers to 3.4.18 and retesting:
- Release notes: 3.4.18 release notes
- Meanwhile, confirm if those PIDs are actually alive on the worker host. If not, cancel those orphaned PENDING runs (they will also count against any pool limit).
Short-term stabilization plan
- In staging:
- Identify and cancel stale PENDING runs to free up slots.
- Re-apply a modest concurrency (e.g., 10–20) and confirm the worker begins picking runs again.
- In prod:
- Upgrade to 3.4.18; restart the worker.
- If you still see PENDING child runs without RUNNING state, share the worker logs for those timestamps and whether the child PID is alive — I can help trace the exact failure point.
If you can share:
- The exact limiter you’re toggling (pool vs queue)
- Counts of PENDING/RUNNING in staging when the limit is applied
- Results of work-queue preview
/ read-runs
when picking stalls
…I’ll zero in on the root cause and propose a precise fix.