:wave: prefect 2.0. We're seeing an issue where a ...
# prefect-community
a
đź‘‹ prefect 2.0. We're seeing an issue where a flow run never leaves the
Pending
state. We're running with
Process
infrastructure and github storage. We're running the agent in the container, and the agent is pid 1, and I know that the agent process that picked up the flow is still alive (we have the process ping for all the flow runs its submitted). because we run many things with strict concurrency, this issue is blocking our flows from running unless we kill the agent or cancel the flow. any thoughts?
c
if you’re running the agent in a container, you can turn the logging level up to debug to see if there are any flow runs available to it. This seems like either a queue or concurrency issue
a
if it were a queue or concurrency issue, wouldn't the flow never leave scheduled and not be in pending?
now we're seeing jobs get stuck in "running" even after all tasks have completed. This is causing a bunch of our business-critical sync jobs to not run.
I was able to "fix" the issue with pending runs by just enabling automation that cancels the runs stuck in pending for more than 15 min
seeing
Copy code
prefect.engine Engine execution of flow run 'c2dbdc39-7f36-4e31-b3e3-f7275dab1257' aborted by orchestrator: Error validating state: DBAPIError("(sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.QueryCanceledError'>: canceling statement due to statement timeout")
around when it looks like the flow "completed" then continued to report its state as running
c
Are you running in prefect cloud, or prefect orion?
a
prefect cloud
c
Do you have an example that you can use to reproduce this since it’s running in
Process
? When did this behavior start? What version of prefect are you using, and how is the deployment registered? Once the flows run, do they actually run successfully? Are you able to enable PREFECT_LOGGING_LEVEL=DEBUG on the agent to see as its polling for scheduled flow runs?
a
i do not know how to repro it, it seems to happen somewhat randomly to a variety of our deployment flow runs. yes, they do run successfully
we now have debug logging on
this behavior started probably last week?
let me see when i saw this log line first
okay well i think the pending issue and the running issue are different, i see https://prefect-community.slack.com/archives/CL09KU1K7/p1675438187032929?thread_ts=1675296054.694959&amp;cid=CL09KU1K7 this log line first appear feb 1 at 8:43am MT
e
cc: @Emilie Hester (so I get thread updates)
a
saw this happen a couple more times today with debug logging on, but there wasn't really anything around in the logs that shed any more light on what's going on. i see
Copy code
Feb 3 15:02:10 d559943086fb production-api DEBUG  prefect.task_runner.concurrent Shutting down task runner...
Feb 3 15:02:10 d559943086fb production-api INFO   prefect.engine Engine execution of flow run 'ccd83d61-086d-4dfc-8aaa-ce54e6459568' aborted by orchestrator: Error validating state: DBAPIError("(sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.QueryCanceledError'>: canceling statement due to statement timeout")
but i also don't really see that flow run mentioned after that at all
i've like basically setup a webhook in our logging product that triggers something in our system to manually cancel the run to unblock the work queue, but its pretty gross and brittle
seems like this is probably impacting more ppl than us, we just happen to run, like, a lot of flows and have pretty stringent concurrency requirements
c
I haven’t had a chance to look at these today unfortunately , I won’t be able to take a look until Monday
a
this is a pretty bad bug for us so if there's anything we can do to escalate this further please let me know.
and im not confident that a webhook triggered by a log line will hold up very long
@Christopher Boyd any updates? I was out last week and these issues have cuased major issues for our team.
c
a
cool, so if im reading this correctly, that's a server side change and we should not need to upgrade anything
âś… 1
great for running flows, we are still seeing the
Pending
flows issue i initially reported, I'll try to get more data on that
is there any additional things we should log on our side for the
Pending
issue? we are already accessing the flow run object so can log info about it. Debug logs aren't really showing anything that seems related
c
Pending usually means its waiting for execution infrastructure
for example with k8s jobs, it means it received the flow run, but might not have enough resources to actually start on the node
a
we're using a local process, so any k8s constraints or image pulling is not hte issue. we are seeing things stay in pending for hours if we don't intervene manually, and in that time, there are periods where flows aren't running, so it seems odd to me that it would be a resource constraint issue. the 7th point in that post concerns me
Check if there is more than one agent polling for runs from the same work queue
yes, we frequently do this because we have to run things in a local process right now, and we needed to scale that horizontally. I was told this was a fine use case, is that no longer true? I'll try turning down the polling interval.
additionally, whenever we cancel the pending flow run and restart it, the agent picks up and runs the flow almost immediately, so that seems to me like resource contention isn't an issue?
e
(cc: @John Weispfenning)
c
Is it possible you’re hitting this issue? https://github.com/PrefectHQ/prefect/issues/8251
There’s a lot going on here, and it’s hard to evaluate without a thorough reproduction / MRE + config
a
we don't have any concurrency limits at the task level, only at the workqueue level. I setup more logging yesterday, but I haven't seen the error since I've re-deployed. I'll update when i have more information. obviously hoping its magically fixed but doubtful since we haven't changed anything about our agent setup. thanks.