https://prefect.io logo
Title
a

Andreas Nigg

11/29/2022, 8:19 AM
Hey 👋 Prefect2.0 Cloud User. For some time now I encounter, that some individual flows runs keep stuck in pending state. This is a flow which runs 96 times per day - and about once or twice per week it gets stuck in run-state "Pending" and never leaves it. (one got in "Pending state 2 days ago and still was there today). No logs in the agent about this flow as well. Without me anything changing, the next flow run which was scheduled 15 Minutes later runs successfully. ( I was able to indentify multiple different flows for which this happened, so it seems not to be related to a specific flow) Any ideas what I can do to debug this issue? I fear it's not really in my hand, is it? And the follow up: Unfortunately, these stuck "pending" tasks block a work queue if concurrency is set to 1. I have a use case where it's from utmost important that only one flow runs at a time - therefore I use the concurrency setting in the queue. However, is there a setting to stop flow runs if they are pending for too long? (Like a pending-timeout?). Or would this be a reasonable feature request?
:gratitude-thank-you: 1
1
Here are some of the flows stuck in pending (not sure if this helps though 🙂 ) Does it help if I provide flow run ids or things like that?
The flows are either Process - Type Infrastructure or k8s jobs. Both can get stuck in pending.
j

Javier Ruere

11/29/2022, 9:25 AM
This also happened to me on a local run. Are you mixing async and sync tasks?
:gratitude-thank-you: 1
a

Andreas Nigg

11/29/2022, 10:01 AM
No. Eg. the "trigger_airbyte" flow you see in the above screenshot is almost 1 to 1 the prefect-airbyte sample code of prefects docs It's only a very rare occurence - so I'd assume the general flow-structure might be fine (?) Some of the flows run 96 times per day and are only stuck in pending once per week.
j

Javier Ruere

11/29/2022, 10:04 AM
It's really super simple.
s

Stéphan Taljaard

12/05/2022, 5:12 AM
Hi I also have random occurrences of flow runs never leaving the Pending state. I'm using
Process
infra
1
:gratitude-thank-you: 1
a

Anna Geller

12/05/2022, 8:30 AM
Any ideas what I can do to debug this issue?
Runs stuck in a Pending state usually indicate an issue with the agent. You could increase resource allocation to your agent e.g. more memory or you could decrease (default should be 5 sec, you could make it 30 sec to poll less frequently and free up agent resources):
prefect config set PREFECT_AGENT_QUERY_INTERVAL='30.0'
also checking the logs if that doesn't help, can you open a GitHub issue with more details about your setup? thanks a lot
is there a setting to stop flow runs if they are pending for too long?
we are working on SLA feature that will allow you to cancel such flow run if it runs or is stuck in pending for longer than X and reschedule it (trigger a new run), not released yet but 🔜
🙌 1
a

Andreas Nigg

12/05/2022, 8:45 AM
I see, thanks for that explanation. I on suspicion used 2 additional agents and last week there were no pending jobs anymore. I also updated to 2.7.0 where you increased the polling interval to 10 seconds by default as it seems. So, all in all, mystery solved 😄 If you mind the question: Why is an under-resourced agent causing flow runs stuck in pending? The agent did not crash, etc. This SLA feature will be awesome 👍
a

Anna Geller

12/05/2022, 10:16 AM
ahh you're spot on, I've heard from another user they had a similar issue when two agents were polling from the same queue that some runs were stuck in Pending
Why is an under-resourced agent causing flow runs stuck in pending?
because it has no capacity to poll for new runs and deploy those to the infra and the runs get picked up by the agent but the agent is "too busy" to actually create infra for them because it keeps switching tasks, and those runs stay in Pending and keep waiting until the agent "get to it" to deploy the run to the infra - that's my understanding, could be wrong
s

Stéphan Taljaard

12/05/2022, 11:10 AM
Your explanation of the "stuck pending" flows make sense The sad thing is, from what I can see, if the agent is too busy, even after a while it gets freed up it still does not "get to" those flows picked from the queue. They always stay pending, even days later. That could be the check then - have a configurable setting for how long (
x
) such that if after
x
time has passed and a flow is still "Pending", it gets revived and reran
👍 1
a

Anna Geller

12/05/2022, 12:41 PM
Makes sense. This will be possible very soon in Cloud using Automations. You'll be able to configure: if any flow run with tag prod stays in Pending state for longer than 10 min, cancel that run and trigger a new one But in OSS you would need to do it maybe with a custom script that queries the API for runs with Pending state and duration in that state and cancel + recreate runs this way
🙌 1