<@ULVA73B9P> I seem to have a problem where multip...
# marvin-ai
z
@Marvin I seem to have a problem where multiple workers (all assigned to the same work pool) attempt to pick up a scheduled flow run which then causes the flow run to get stuck in PENDING. The logs look like this:
Copy code
Worker 'ProcessWorker 06e966df-0a35-4265-a440-8ad861f6caab' submitting flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'
07:05:05 PM
prefect.flow_runs.worker

Worker 'ProcessWorker 56a971a0-d860-4469-b5a5-eb94813beec8' submitting flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'
07:05:06 PM
prefect.flow_runs.worker

Worker 'ProcessWorker 4a37f28d-eaa8-41e4-8d0d-f015a66fe98b' submitting flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'
07:05:08 PM
prefect.flow_runs.worker

Aborted submission of flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'. Server sent an abort signal: This run is in a PENDING state and cannot transition to a PENDING state.
07:05:16 PM
prefect.flow_runs.worker

Aborted submission of flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'. Server sent an abort signal: This run is in a PENDING state and cannot transition to a PENDING state.
07:05:19 PM
prefect.flow_runs.worker

Opening process...
07:05:30 PM
prefect.flow_runs.runner

Completed submission of flow run 'e33f8d75-d3f5-41ba-8dce-3bab1d0de71f'
07:05:39 PM
prefect.flow_runs.worker

 > Running set_working_directory step...
I noticed this much more when I tried scaling workers up to handle a large backlog, but after scaling down i notice it still happens, just less frequently.
m
thought for 499.8 seconds
Thanks for the detailed logs — that helps. What you’re seeing has two parts: 1) Aborted submission messages - With multiple workers polling the same work pool, it’s normal for more than one worker to race for the same eligible flow run. - The server uses optimistic orchestration. One worker “wins” and moves the run to PENDING, the others get: Aborted submission ... This run is in a PENDING state and cannot transition to a PENDING state. - Those aborted messages are expected under load and are harmless by themselves. 2) Runs stuck in PENDING (not expected) If a run remains in PENDING and doesn’t advance to RUNNING, that usually points to one of these: - Client/API timeout during state proposal: the worker set the run to PENDING on the server, but timed out locally and retried; subsequent attempts are rejected while the run sits in PENDING. This is a known class of behavior in Prefect 2.x and can appear in 3.x under network strain. See issue #10263. - Worker interruption around retries or handoff: there are edge cases when a worker is replaced while a flow has in-process retries; runs can get stuck in loops/blocked transitions. See issue #16126. - Local concurrency limit accounting bug after an aborted submission: the worker thinks it’s “using” a slot even though the run was never claimed; this makes the worker appear idle/stuck. See issue #15952. Workaround: restart the affected worker. What you can do to mitigate - Upgrade to the latest Prefect 3.x. There have been ongoing fixes in worker orchestration and retries. If you share your exact version, I can confirm if any relevant patches landed after it. - Increase the API request timeout for workers to reduce “server succeeded but client retried” cases:
Copy code
PREFECT_API_REQUEST_TIMEOUT=60 prefect worker start -p your-pool
If you see httpx timeout messages in worker logs, this typically helps. - Partition the workload to reduce contention: - Use separate work queues and point different workers at different queues:
Copy code
prefect worker start -p your-pool -q queue-a -n worker-a
    prefect worker start -p your-pool -q queue-b -n worker-b
- If you’re pushing through a large backlog, this spreads runs and reduces simultaneous claims on the exact same run. - Throttle total concurrency at the pool to smooth bursts:
Copy code
prefect work-pool set-concurrency-limit "your-pool" 10
This won’t eliminate races entirely, but it can reduce the blast radius during spikes. - If you use per-worker limits (e.g.,
--limit 1
) and you notice “Flow run limit reached” after an aborted submission, that’s the bug in #15952. Workaround: restart the worker process to clear the stuck local counter. - Collect more signal when it happens: - Run workers with debug logs to catch timeouts/retries
Copy code
PREFECT_LOGGING_LEVEL=DEBUG prefect worker start -p your-pool
- Note any httpx.ReadTimeout, SSLWantReadError, or retry logs around the moment runs get stuck in PENDING. A few clarifying questions so I can be more specific: - Which Prefect version are you on? - Cloud or OSS Server? - What worker type are you using (Process, Docker, Kubernetes)? - Are you setting
--limit
on workers, or using pool/queue-level concurrency limits? - When you say “stuck in PENDING”, for how long do they stay there, and do you see any network timeout messages in worker logs at that time? Related references: - Duplicate PENDING after client timeout: #10263 - Retries/worker replacement loop: #16126 - Local concurrency stuck after abort (Prefect 3.x): #15952
- Settings overview (for
PREFECT_API_REQUEST_TIMEOUT
): Prefect settings