<@ULVA73B9P> I've got a few workers setup to proce...
# ask-marvin
j
@Marvin I've got a few workers setup to process the same pool, each of which have --limit=1 to ensure that concurrent runs don't happen on the same machine, concurrency between workers is desirable though so no limit has been set on the pool. At some point, one of my workers had the following arise
Aborted submission of flow run ... Server sent an abort signal: this run is in a PENDING state and cannot transition to a PENDING state
. Presumably this is when multiple workers raced to claim a run. This is normally fine but the worker which aborted the submission thinks it is running that a flow and is now stuck in a state of
Flow run limit reached; 1 flow runs in progress.
Is this a bug? I'm using prefect
3.0.10
on Windows
m
It appears that what you're encountering is a known issue related to Prefect workers and flow runs getting stuck, especially when an abort signal is received. From your description, it seems the issue arises when multiple workers race to claim a run with a server abort signal, leaving one worker thinking it is executing a flow. This is accompanied by the message:
Flow run limit reached; 1 flow runs in progress
. Here's what I've found regarding similar cases: 1. There are situations where flow runs remain in a "Running" state even after the agent is interrupted or crashes, which can cause them to become stuck. This issue has been discussed on GitHub, indicating that the flow remains in its state because the transition to a new state did not occur properly when the worker was interrupted (see GitHub Issue 7239). 2. Another similar instance describes an agent or worker crashing due to resource constraints and flow runs getting stuck in a
PENDING
state due to failed state transitions (GitHub Issue 12875). In your scenario, it looks like the worker that aborted and thought it still claimed the flow run did not transition properly out of the
PENDING
state. This can indeed resemble a bug, particularly in how the state management is handled when multiple workers are involved. Although there isn't a direct fix mentioned in the documentation or issues, you might want to: - Ensure that you are on the latest version of Prefect 3.x, as newer releases may have bug fixes. - Consider setting up some logging or monitoring on your workers to capture more details when these issues occur. - You could also try scaling the resources of your workers or reviewing the concurrency configurations to see if any adjustments could stabilize the environment. For ongoing support, you may want to report this as an issue to the Prefect team if it persists with the current version you're using. If you're curious, you can also check out some of the related documentation on configuring concurrency limits here and the working mechanisms of workers here. Just think of it as an epic adventure in documentation... well, maybe not that epic, but hopefully useful!
j