wave prefect 2 0 We re seeing an issue where a flow run nev Prefect Community #ask-community

:wave: prefect 2.0. We're seeing an issue where a ...

Alix Cook

02/02/2023, 12:00 AM

👋 prefect 2.0. We're seeing an issue where a flow run never leaves the

Pending

state. We're running with

Process

infrastructure and github storage. We're running the agent in the container, and the agent is pid 1, and I know that the agent process that picked up the flow is still alive (we have the process ping for all the flow runs its submitted). because we run many things with strict concurrency, this issue is blocking our flows from running unless we kill the agent or cancel the flow. any thoughts?

Christopher Boyd

02/02/2023, 2:40 PM

if you’re running the agent in a container, you can turn the logging level up to debug to see if there are any flow runs available to it. This seems like either a queue or concurrency issue

Alix Cook

02/02/2023, 2:51 PM

if it were a queue or concurrency issue, wouldn't the flow never leave scheduled and not be in pending?

Alix Cook

02/03/2023, 3:18 PM

now we're seeing jobs get stuck in "running" even after all tasks have completed. This is causing a bunch of our business-critical sync jobs to not run.

Alix Cook

02/03/2023, 3:23 PM

I was able to "fix" the issue with pending runs by just enabling automation that cancels the runs stuck in pending for more than 15 min

Alix Cook

02/03/2023, 3:29 PM

seeing

Copy code

prefect.engine Engine execution of flow run 'c2dbdc39-7f36-4e31-b3e3-f7275dab1257' aborted by orchestrator: Error validating state: DBAPIError("(sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.QueryCanceledError'>: canceling statement due to statement timeout")

around when it looks like the flow "completed" then continued to report its state as running

Christopher Boyd

02/03/2023, 3:38 PM

Are you running in prefect cloud, or prefect orion?

Alix Cook

02/03/2023, 3:46 PM

prefect cloud

Christopher Boyd

02/03/2023, 3:49 PM

Do you have an example that you can use to reproduce this since it’s running in

Process

? When did this behavior start? What version of prefect are you using, and how is the deployment registered? Once the flows run, do they actually run successfully? Are you able to enable PREFECT_LOGGING_LEVEL=DEBUG on the agent to see as its polling for scheduled flow runs?

Alix Cook

02/03/2023, 4:06 PM

i do not know how to repro it, it seems to happen somewhat randomly to a variety of our deployment flow runs. yes, they do run successfully

Alix Cook

02/03/2023, 4:06 PM

we now have debug logging on

Alix Cook

02/03/2023, 4:06 PM

this behavior started probably last week?

Alix Cook

02/03/2023, 4:08 PM

let me see when i saw this log line first

Alix Cook

02/03/2023, 4:10 PM

okay well i think the pending issue and the running issue are different, i see https://prefect-community.slack.com/archives/CL09KU1K7/p1675438187032929?thread_ts=1675296054.694959&cid=CL09KU1K7 this log line first appear feb 1 at 8:43am MT

Emilie Hester

02/03/2023, 6:35 PM

cc: @Emilie Hester (so I get thread updates)

Alix Cook

02/03/2023, 10:45 PM

saw this happen a couple more times today with debug logging on, but there wasn't really anything around in the logs that shed any more light on what's going on. i see

Copy code

Feb 3 15:02:10 d559943086fb production-api DEBUG  prefect.task_runner.concurrent Shutting down task runner...
Feb 3 15:02:10 d559943086fb production-api INFO   prefect.engine Engine execution of flow run 'ccd83d61-086d-4dfc-8aaa-ce54e6459568' aborted by orchestrator: Error validating state: DBAPIError("(sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.QueryCanceledError'>: canceling statement due to statement timeout")

but i also don't really see that flow run mentioned after that at all

Alix Cook

02/03/2023, 10:47 PM

i've like basically setup a webhook in our logging product that triggers something in our system to manually cancel the run to unblock the work queue, but its pretty gross and brittle

Alix Cook

02/03/2023, 10:47 PM

seems like this is probably impacting more ppl than us, we just happen to run, like, a lot of flows and have pretty stringent concurrency requirements

Christopher Boyd

02/03/2023, 10:50 PM

I haven’t had a chance to look at these today unfortunately , I won’t be able to take a look until Monday

Alix Cook

02/03/2023, 10:55 PM

this is a pretty bad bug for us so if there's anything we can do to escalate this further please let me know.

Alix Cook

02/03/2023, 10:56 PM

and im not confident that a webhook triggered by a log line will hold up very long

Alix Cook

02/13/2023, 4:32 PM

@Christopher Boyd any updates? I was out last week and these issues have cuased major issues for our team.

Christopher Boyd

02/13/2023, 4:34 PM

Hi Alix, https://github.com/PrefectHQ/prefect/issues/8435

Alix Cook

02/13/2023, 5:25 PM

cool, so if im reading this correctly, that's a server side change and we should not need to upgrade anything

✅ 1

Alix Cook

02/13/2023, 5:26 PM

great for running flows, we are still seeing the

Pending

flows issue i initially reported, I'll try to get more data on that

Alix Cook

02/13/2023, 5:27 PM

is there any additional things we should log on our side for the

Pending

issue? we are already accessing the flow run object so can log info about it. Debug logs aren't really showing anything that seems related

Christopher Boyd

02/14/2023, 1:30 AM

Pending usually means its waiting for execution infrastructure

Christopher Boyd

02/14/2023, 1:31 AM

for example with k8s jobs, it means it received the flow run, but might not have enough resources to actually start on the node

Christopher Boyd

02/14/2023, 1:32 AM

https://discourse.prefect.io/t/my-flow-run-in-prefect-2-is-stuck-in-a-pending-state-what-can-i-do/2012

Alix Cook

02/14/2023, 3:30 PM

we're using a local process, so any k8s constraints or image pulling is not hte issue. we are seeing things stay in pending for hours if we don't intervene manually, and in that time, there are periods where flows aren't running, so it seems odd to me that it would be a resource constraint issue. the 7th point in that post concerns me

Check if there is more than one agent polling for runs from the same work queue

yes, we frequently do this because we have to run things in a local process right now, and we needed to scale that horizontally. I was told this was a fine use case, is that no longer true? I'll try turning down the polling interval.

Alix Cook

02/14/2023, 3:32 PM

additionally, whenever we cancel the pending flow run and restart it, the agent picks up and runs the flow almost immediately, so that seems to me like resource contention isn't an issue?

Emilie Hester

02/14/2023, 3:57 PM

(cc: @John Weispfenning)

Christopher Boyd

02/14/2023, 3:59 PM

Is it possible you’re hitting this issue? https://github.com/PrefectHQ/prefect/issues/8251

Christopher Boyd

02/14/2023, 4:00 PM

There’s a lot going on here, and it’s hard to evaluate without a thorough reproduction / MRE + config

Alix Cook

02/14/2023, 4:41 PM

we don't have any concurrency limits at the task level, only at the workqueue level. I setup more logging yesterday, but I haven't seen the error since I've re-deployed. I'll update when i have more information. obviously hoping its magically fixed but doubtful since we haven't changed anything about our agent setup. thanks.

4 Views

Open in Slack

Previous Next