< Marvin> I seem to have a problem where multiple workers al Prefect Community #marvin-ai

<@ULVA73B9P> I seem to have a problem where multip...

Zach Chin

11/07/2025, 3:09 AM

@Marvin I seem to have a problem where multiple workers (all assigned to the same work pool) attempt to pick up a scheduled flow run which then causes the flow run to get stuck in PENDING. The logs look like this:

Copy code

Worker 'ProcessWorker 06e966df-0a35-4265-a440-8ad861f6caab' submitting flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'
07:05:05 PM
prefect.flow_runs.worker

Worker 'ProcessWorker 56a971a0-d860-4469-b5a5-eb94813beec8' submitting flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'
07:05:06 PM
prefect.flow_runs.worker

Worker 'ProcessWorker 4a37f28d-eaa8-41e4-8d0d-f015a66fe98b' submitting flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'
07:05:08 PM
prefect.flow_runs.worker

Aborted submission of flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'. Server sent an abort signal: This run is in a PENDING state and cannot transition to a PENDING state.
07:05:16 PM
prefect.flow_runs.worker

Aborted submission of flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'. Server sent an abort signal: This run is in a PENDING state and cannot transition to a PENDING state.
07:05:19 PM
prefect.flow_runs.worker

Opening process...
07:05:30 PM
prefect.flow_runs.runner

Completed submission of flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'
07:05:39 PM
prefect.flow_runs.worker

 > Running set_working_directory step...

I noticed this much more when I tried scaling workers up to handle a large backlog, but after scaling down i notice it still happens, just less frequently.

Marvin

11/07/2025, 3:09 AM

✅ thought for 499.8 seconds

Marvin

11/07/2025, 3:17 AM

Thanks for the detailed logs — that helps. What you’re seeing has two parts: 1) Aborted submission messages - With multiple workers polling the same work pool, it’s normal for more than one worker to race for the same eligible flow run. - The server uses optimistic orchestration. One worker “wins” and moves the run to PENDING, the others get: Aborted submission ... This run is in a PENDING state and cannot transition to a PENDING state. - Those aborted messages are expected under load and are harmless by themselves. 2) Runs stuck in PENDING (not expected) If a run remains in PENDING and doesn’t advance to RUNNING, that usually points to one of these: - Client/API timeout during state proposal: the worker set the run to PENDING on the server, but timed out locally and retried; subsequent attempts are rejected while the run sits in PENDING. This is a known class of behavior in Prefect 2.x and can appear in 3.x under network strain. See issue #10263. - Worker interruption around retries or handoff: there are edge cases when a worker is replaced while a flow has in-process retries; runs can get stuck in loops/blocked transitions. See issue #16126. - Local concurrency limit accounting bug after an aborted submission: the worker thinks it’s “using” a slot even though the run was never claimed; this makes the worker appear idle/stuck. See issue #15952. Workaround: restart the affected worker. What you can do to mitigate - Upgrade to the latest Prefect 3.x. There have been ongoing fixes in worker orchestration and retries. If you share your exact version, I can confirm if any relevant patches landed after it. - Increase the API request timeout for workers to reduce “server succeeded but client retried” cases:

Copy code

PREFECT_API_REQUEST_TIMEOUT=60 prefect worker start -p your-pool

If you see httpx timeout messages in worker logs, this typically helps. - Partition the workload to reduce contention: - Use separate work queues and point different workers at different queues:

Copy code

prefect worker start -p your-pool -q queue-a -n worker-a
    prefect worker start -p your-pool -q queue-b -n worker-b

- If you’re pushing through a large backlog, this spreads runs and reduces simultaneous claims on the exact same run. - Throttle total concurrency at the pool to smooth bursts:

Copy code

prefect work-pool set-concurrency-limit "your-pool" 10

This won’t eliminate races entirely, but it can reduce the blast radius during spikes. - If you use per-worker limits (e.g.,

--limit 1

) and you notice “Flow run limit reached” after an aborted submission, that’s the bug in #15952. Workaround: restart the worker process to clear the stuck local counter. - Collect more signal when it happens: - Run workers with debug logs to catch timeouts/retries

Copy code

PREFECT_LOGGING_LEVEL=DEBUG prefect worker start -p your-pool

- Note any httpx.ReadTimeout, SSLWantReadError, or retry logs around the moment runs get stuck in PENDING. A few clarifying questions so I can be more specific: - Which Prefect version are you on? - Cloud or OSS Server? - What worker type are you using (Process, Docker, Kubernetes)? - Are you setting

--limit

on workers, or using pool/queue-level concurrency limits? - When you say “stuck in PENDING”, for how long do they stay there, and do you see any network timeout messages in worker logs at that time? Related references: - Duplicate PENDING after client timeout: #10263 - Retries/worker replacement loop: #16126 - Local concurrency stuck after abort (Prefect 3.x): #15952

Marvin

11/07/2025, 3:17 AM

- Settings overview (for

PREFECT_API_REQUEST_TIMEOUT

): Prefect settings

Open in Slack

Previous Next