<@ULVA73B9P> I seem to be having a problem where j...
# marvin-ai
z
@Marvin I seem to be having a problem where jobs get stuck in
Pending
and
Running
. This blocks the concurrency limit which was set to 30, so with 5 stuck in pending and 5 stuck in running now we could only have 20 running flows. This alone is fine because I've set up automations to move these to cancelled (which we can rerun later), however, it seems to be a recurring problem. Immediately after cleaning up the pending jobs, the next 5 or so jobs will get stuck in pending as well. Additionally, the concurrency limit seems to be blocked up by something else I can't account for. For example, I'll filter for running, pending, and cancelling and see 24 flows. What happened to the other 6 available under the concurrency limit? I've been increasing the concurrency limit, which kicks the problem down the road, but the problem is gradually increasing. Restarting workers doesn't seem to have an impact. Any direction would be much appreciated.
m
thought for 489.8 seconds
Thanks for the detailed description — a few things can cause this pattern in Prefect 3.x. Let’s narrow it down and give you some concrete checks. First, a quick model of what can block flow execution: - Work pool concurrency limit: caps total concurrent flow runs in that pool. - Worker capacity: each worker has its own concurrency; the effective capacity is min(pool limit, sum of worker limits). - Global (tag-based) concurrency limits: can hold runs/tasks in an AwaitingConcurrencySlot state. - Stale runs: runs stuck in Pending or Running can “hold” resources until cleared. Given your symptoms (some runs stuck in Pending/Running, not reaching the pool limit, and the issue recurring), here’s a prioritized checklist. 1) Confirm worker capacity vs pool limit - If your work pool limit is 30 but the sum of your active workers’ concurrency is 24, you’ll only ever see 24 running/pending at once regardless of the pool limit. - Check your work pool details:
Copy code
prefect work-pool inspect "<your-pool>"
- Check each worker’s configured concurrency (the process/k8s/docker workers have a concurrency/limit setting when you start them). If you’re starting workers manually, double‑check the options you pass to
prefect worker start
and your base job template. For reference:
Copy code
prefect worker start --help
If you can share how many workers you have, their pool, type, and any concurrency/limit you pass when starting them, I can help compute your effective capacity. 2) Inspect the “Pending” flow runs to learn why they’re pending Pick one stuck flow run and inspect:
Copy code
prefect flow-run inspect <FLOW_RUN_ID>
Look closely at: - state message (will often say “awaiting infrastructure”, “awaiting concurrency slot”, “no available workers”, etc.) - work pool / work queue - deployment and infrastructure block metadata (e.g., pod/container provisioning failures) 3) Check for global (tag-based) concurrency limits holding slots A common hidden blocker is a tag-based concurrency limit that’s exhausted or leaked:
Copy code
prefect concurrency-limit ls
prefect concurrency-limit inspect <TAG>
Note:
inspect
shows
active_slots
with TaskRun IDs currently holding a slot. If you see slots held by very old or cancelled runs, you can free them:
Copy code
prefect concurrency-limit reset <TAG>
Docs: Global concurrency limits (3.x) 4) Look for runs in “Awaiting Concurrency Slot” Some runs may not show up in your filter when you only include Running/Pending/Cancelling. In the UI, try including all states or specifically look for “Awaiting Concurrency Slot.” If you see many there, they are waiting on a tag-based concurrency limit rather than the work pool limit. There is a known issue on Prefect 3.1.4 where runs in AwaitingConcurrencySlot may not be picked up when a slot frees up: - Runs in AwaitingConcurrencySlot not picked up (3.1.4) What version are you on? If it’s 3.1.4, upgrading to the latest 3.x is recommended. 5) “Running” but stuck — ensure failed infra/run detection is working - If a container/pod crashes or a process is killed, runs can linger in Running if the system didn’t observe a clean exit. This can happen if infra fails and the engine can’t report back. - Consider adding timeouts on your flows/deployments to fail fast and release slots. - Ensure your workers are healthy and can start new infrastructure reliably. Worker healthcheck docs (for monitoring workers): Worker healthchecks 6) Reconcile the “missing 6” When you see 24 total across Running/Pending/Cancelling but expect 30: - Most likely: total worker capacity is ~24 even though pool limit is 30. - Also possible: some runs are in other states (e.g., AwaitingConcurrencySlot, AwaitingRetry, Late) or tied up by tag-based limits.
- Use the pool inspect and concurrency-limit inspect commands above, and broaden your UI filter to include Awaiting states. If helpful, run:
Copy code
prefect flow-run ls --limit 200
then filter in the UI to see exact counts by state and pool. 7) Automations Your “cancel stuck” automations are smart. You can also: - Trigger when a run has been Running for > N minutes to cancel or mark crashed. - Trigger when a flow is Pending for > N minutes to notify or cancel. If you can share: - Prefect version (and Cloud vs. OSS) - Work pool type and name - Pool concurrency limit - Number of workers and the concurrency/limit for each - Whether you’re using any tag-based concurrency limits (and tags) - A sample
prefect flow-run inspect <id>
from a stuck Pending run …I can point to the specific root cause and the exact fix.
z
@Marvin • Prefect version: 3.4.17 ◦ OSS • work pool type: Process • work pool name:
eligibility-work-pool
• Concurrency limit: 50 • Number of workers: 1 ◦ with concurrency limit: Unlimited • Tag based concurrency: None Here's a sample flow run of one of the stuck pending runs:
Copy code
FlowRun(
    id='34967369-1a60-4052-bc6f-94617af08665',
    created=DateTime(2025, 9, 16, 2, 37, 39, 794713, tzinfo=Timezone('UTC')),
    updated=DateTime(2025, 9, 16, 2, 43, 50, 271699, tzinfo=Timezone('UTC')),
    name='authentic-llama',
    flow_id='4df4ec49-2e48-4f6a-bde2-326d5abd0ace',
    state_id=UUID('01995068-0f38-7673-b0f0-70fe9340edce'),
    deployment_id=UUID('0d924e4f-8884-45fd-988d-cbdadc424594'),
    deployment_version='89b61d8545f14617b2f626cbcf32b6b6',
    work_queue_name='default',
    parameters={
       REDACTED
    },
    context={},
    empirical_policy=FlowRunPolicy(pause_keys=set()),
    tags=[],
    labels={
        'prefect.flow.id': '4df4ec49-2e48-4f6a-bde2-326d5abd0ace',
        '__OTEL_TRACEPARENT': '00-f72708c0fe099f176419fab679b1e05b-8c0ec36eef267816-01',
        'prefect.worker.name': 'ProcessWorker f9705262-42a9-4d8d-b376-ff1b730ca9f9',
        'prefect.worker.type': 'process',
        'prefect.work-pool.id': '66527145-d1dd-44bf-a489-0d2028547357',
        'prefect.deployment.id': '0d924e4f-8884-45fd-988d-cbdadc424594',
        'prefect.work-pool.name': 'eligibility-work-pool'
    },
    parent_task_run_id=UUID('d32d5127-8634-4b8d-8d59-44a74df5a905'),
    expected_start_time=DateTime(2025, 9, 16, 2, 37, 39, 6893, tzinfo=Timezone('UTC')),
    estimated_start_time_delta=datetime.timedelta(seconds=568, microseconds=723630),
    infrastructure_pid='948',
    work_queue_id=UUID('b1f00dcf-d205-4fbd-8cc3-ccf421d7258c'),
    work_pool_id=UUID('66527145-d1dd-44bf-a489-0d2028547357'),
    work_pool_name='eligibility-work-pool',
    state=State(
        id='01995068-0f38-7673-b0f0-70fe9340edce',
        type=StateType.PENDING,
        name='Pending',
        timestamp=datetime.datetime(2025, 9, 16, 2, 43, 40, 984635, tzinfo=TzInfo(UTC)),
        state_details=StateDetails(
            flow_run_id=UUID('34967369-1a60-4052-bc6f-94617af08665'),
            task_run_id=UUID('d32d5127-8634-4b8d-8d59-44a74df5a905'),
            scheduled_time=DateTime(2025, 9, 16, 2, 37, 39, 6893, tzinfo=Timezone('UTC')),
            transition_id=UUID('c404f442-ce6d-4c5b-987c-3a3a93f43775')
        )
    ),
    job_variables={},
    state_type=StateType.PENDING,
    state_name='Pending'
)
Here is my work pool inspect output for good measure:
Copy code
WorkPool(
    id='66527145-d1dd-44bf-a489-0d2028547357',
    created=DateTime(2025, 7, 25, 21, 14, 40, 639243, tzinfo=Timezone('UTC')),
    updated=DateTime(2025, 9, 16, 2, 54, 7, 226812, tzinfo=Timezone('UTC')),
    name='eligibility-work-pool',
    type='process',
    base_job_template={
        'variables': {
            'type': 'object',
            'properties': {
                'env': {'type': 'object', 'title': 'Environment Variables', 'description': 'Environment variables to set when starting a flow run.', 'additionalProperties': {'anyOf': [{'type': 'string'}, {'type': 'null'}]}},
                'name': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Name', 'default': None, 'description': 'Name given to infrastructure created by a worker.'},
                'labels': {'type': 'object', 'title': 'Labels', 'description': 'Labels applied to infrastructure created by a worker.', 'additionalProperties': {'type': 'string'}},
                'command': {
                    'anyOf': [{'type': 'string'}, {'type': 'null'}],
                    'title': 'Command',
                    'default': None,
                    'description': 'The command to use when starting a flow run. In most cases, this should be left blank and the command will be automatically generated by the worker.'
                },
                'working_dir': {
                    'anyOf': [{'type': 'string', 'format': 'path'}, {'type': 'null'}],
                    'title': 'Working Directory',
                    'default': None,
                    'description': 'If provided, workers will open flow run processes within the specified path as the working directory. Otherwise, a temporary directory will be created.'
                },
                'stream_output': {'type': 'boolean', 'title': 'Stream Output', 'default': True, 'description': 'If enabled, workers will stream output from flow run processes to local standard output.'}
            }
        },
        'job_configuration': {'env': '{{ env }}', 'name': '{{ name }}', 'labels': '{{ labels }}', 'command': '{{ command }}', 'working_dir': '{{ working_dir }}', 'stream_output': '{{ stream_output }}'}
    },
    concurrency_limit=50,
    status=WorkPoolStatus.READY,
    storage_configuration=WorkPoolStorageConfiguration(),
    default_queue_id='b1f00dcf-d205-4fbd-8cc3-ccf421d7258c'
)
m
thought for 1220.6 seconds
Thanks for the detailed artifacts — they’re super helpful. What I’m seeing: - The run has a parent_task_run_id, so it’s a child run (likely from a subflow or run_deployment). - It has been claimed by a Process worker (labels include
prefect.worker.name
,
prefect.worker.type
) and the worker spawned a process (
infrastructure_pid='948'
). - However, it never transitioned to RUNNING; it’s stuck in PENDING. In practice, this pattern usually means the worker successfully launched the child process, but the process never initialized the engine (e.g., hung or crashed very early before the engine registered “Running”). Typical causes: - Import-time hangs or crashes (e.g., heavy module imports, environment initialization, deadlocks) - Early failure in runner startup (before the engine could set state), leaving the run in PENDING - Resource contention on the worker host when starting many processes at once There was a recent bug fix that may be relevant: “Fix FileExistsError in runner temp directory creation” in 3.4.18. If your child process hit this edge case, it could crash before the engine started and leave the run PENDING. I’d recommend upgrading to confirm whether this resolves the issue going forward. - Release notes: 3.4.18 release notes Quick triage checks 1) Worker logs at the time of the run - Please share the worker logs (around 0237–0244 UTC). Look for: - “Submitted flow run … to infrastructure” and process start messages - Any stack traces from the child process - Whether the worker later reports cleanup/retries 2) Is the child process still alive? - On the worker host, check the PID you see on the run (
948
in your example): -
ps -p 948 -o pid,cmd=
- If it’s not running, the run is orphaned in PENDING (confirming early crash/hang). 3) Do all stuck runs have a parent_task_run_id? - If it’s only child runs getting stuck (vs top-level runs), that narrows the issue to how subflows/deployments are launched or initialized. Helpful CLI commands (verified) - Inspect a flow run:
Copy code
prefect flow-run inspect 34967369-1a60-4052-bc6f-94617af08665
- View logs for a run (often empty in this scenario, which supports “engine never started”):
Copy code
prefect flow-run logs 34967369-1a60-4052-bc6f-94617af08665
- Inspect the parent task run for clues:
Copy code
prefect task-run inspect d32d5127-8634-4b8d-8d59-44a74df5a905
- Cancel a stuck run so it doesn’t hold capacity:
Copy code
prefect flow-run cancel 34967369-1a60-4052-bc6f-94617af08665
Recommended next steps - Upgrade to 3.4.18 (link above) — it includes a runner startup fix that could prevent orphaned PENDING runs in exactly this phase of startup. - Restart your worker process (to clear any odd state). - For now, cancel any stuck PENDING runs and re-trigger them. - If you’re launching many child runs at once, consider temporarily reducing the work pool concurrency (e.g., set it to 10–20) to rule out resource contention on a single worker host:
Copy code
prefect work-pool set-concurrency-limit "eligibility-work-pool" 20
(You can restore it after testing.) - If these child runs are created via
run_deployment
or subflows, could you share a small snippet showing how they’re launched? Patterns like heavy module-level work or environment setup in the child’s code can cause early hangs. Open questions for you - Are top-level runs fine, and only child runs are affected? - Do you see any errors in the worker logs around the timestamps for these PENDING runs? - On the worker host, was PID 948 (from your example) actually alive? If not, it strengthens the “early crash before engine start” diagnosis. - Roughly how many child runs are being launched concurrently at peak? If you can share those details (especially logs), I can help pinpoint whether this is an environment init issue, a load/concurrency issue, or something we need to dive deeper into.
z
@Marvin I'm sorry, i have to be skeptical about these suggestions. I'm having different behavior in different environments. In production I'm having the issue as we've been discussing. In staging, I'm having the issue where ANY concurrency limit seems to stop workers from picking up any flows. I've set it as high as 2000 and it didn't make a difference. It doesn't matter if it's a top level flow or a subflow. There are no errors in the logs. If i turn off concurrency limits, it picks up flows, however I can't leave it off because it will continue to pick up flows until the machine is starved of memory and jobs will crash. I've tried turning off concurrency limits, letting it pick up a handful of jobs, then re-enabling it, and they are able to complete fine. Staging and production have the same setup, and they are running on different isolated machines with different isolated databases.
m
thought for 155.6 seconds
Totally fair to be skeptical — thanks for the extra context. Let’s zoom in on the two symptoms and get you back to a stable setup. What your staging behavior strongly suggests - When “ANY concurrency limit” on the work pool/queue causes zero picks, the most common cause is that the server thinks the pool/queue is already at (or above) its active limit because there are many existing PENDING/RUNNING flow runs counted against it. - If staging has accumulated a large number of old PENDING flow runs (especially from prior tests or failed startups), then setting any finite limit can immediately block new work; turning the limit off lets the worker pick up new runs again, which fits what you’re seeing. Quick checks to confirm - Work pool and queue state
Copy code
prefect work-pool inspect "eligibility-work-pool"
prefect work-queue ls -p "eligibility-work-pool"
prefect work-queue inspect default -p "eligibility-work-pool"
- How many PENDING/RUNNING in staging right now? If this number >= your limit, the worker won’t pick new runs.
Copy code
prefect flow-run ls --state PENDING --limit 200
prefect flow-run ls --state RUNNING --limit 200
(If this returns a ton of runs, especially very old ones, it explains the “any limit blocks” behavior.) - Preview queue eligibility (very helpful to see what the server will hand out)
Copy code
prefect work-queue preview -p "eligibility-work-pool" default
prefect work-queue read-runs -p "eligibility-work-pool" default
If the diagnosis holds, clean up the stale runs and then re-apply a reasonable limit - Cancel old PENDING runs (e.g., older than N minutes/hours) so they stop counting toward pool concurrency:
Copy code
# ad hoc, one-off:
prefect flow-run cancel <run-id>
- To bulk-cancel PENDING runs on the default queue (staging), you can use this script:
Copy code
import asyncio
from datetime import datetime, timedelta, timezone

from prefect.client.orchestration import get_client
from prefect.client.schemas import filters
from prefect.client.schemas.actions import StateCreate
from prefect.client.schemas.objects import StateType

# adjust as needed
QUEUE = "default"
OLDER_THAN = timedelta(hours=6)

async def main():
    cutoff = datetime.now(timezone.utc) - OLDER_THAN
    async with get_client() as client:
        fr_filter = filters.FlowRunFilter(
            state_type=filters.FlowRunFilterStateType(any_=["PENDING"]),
            work_queue_name=filters.FlowRunFilterWorkQueueName(any_=[QUEUE]),
            start_time=filters.FlowRunFilterStartTime(before_=cutoff),
        )
        runs = await client.read_flow_runs(flow_run_filter=fr_filter, limit=1000)
        print(f"Found {len(runs)} old PENDING runs on queue={QUEUE}; cancelling...")
        for r in runs:
            try:
                await client.set_flow_run_state(
                    r.id, state=StateCreate(type=StateType.CANCELLED, name="Cancelled (cleanup)")
                )
            except Exception as e:
                print(f"Failed to cancel {r.id}: {e}")

asyncio.run(main())
- Then set a sane limit (start low, verify stability, then increase):
Copy code
prefect work-pool set-concurrency-limit "eligibility-work-pool" 10
# or per-queue if you are using queue-level limits
prefect work-queue set-concurrency-limit default -p "eligibility-work-pool" 10
Important clarifications to help me pinpoint the gating layer - Which limiter are you toggling when the worker stops picking: work pool concurrency limit or work queue concurrency limit? - Roughly how many PENDING flow runs are present in staging before you set the limit? - Are both staging and prod on Prefect 3.4.17 for the server and workers? - When you enable the limit and the worker picks nothing, what do you see from: -
prefect work-queue preview -p eligibility-work-pool default
-
prefect work-queue read-runs -p eligibility-work-pool default
On the production PENDING child runs
- Your sample shows the worker launched a process (
infrastructure_pid=948
) but the engine never marked the run RUNNING. That’s typically an early-startup crash/hang. There is a recent fix that could be related: “Fix FileExistsError in runner temp directory creation” in 3.4.18 — early runner failures can leave runs in PENDING. - I recommend upgrading server/workers to 3.4.18 and retesting: - Release notes: 3.4.18 release notes - Meanwhile, confirm if those PIDs are actually alive on the worker host. If not, cancel those orphaned PENDING runs (they will also count against any pool limit). Short-term stabilization plan - In staging: - Identify and cancel stale PENDING runs to free up slots. - Re-apply a modest concurrency (e.g., 10–20) and confirm the worker begins picking runs again. - In prod: - Upgrade to 3.4.18; restart the worker. - If you still see PENDING child runs without RUNNING state, share the worker logs for those timestamps and whether the child PID is alive — I can help trace the exact failure point. If you can share: - The exact limiter you’re toggling (pool vs queue) - Counts of PENDING/RUNNING in staging when the limit is applied - Results of
work-queue preview
/
read-runs
when picking stalls …I’ll zero in on the root cause and propose a precise fix.