Hey all I m having some issues where flow runs will hang ind Prefect Community #ask-community

Hey all, I'm having some issues where flow runs wi...

Brian Oldak

02/04/2025, 7:12 AM

Hey all, I'm having some issues where flow runs will hang indefinitely. There seem to be two issues: 1. Root cause of why sometimes a flow will just run indefinitely and stop logging after some amount of time 2. Flow not timing out even though I've configured the flow with

@flow(timeout_seconds=16200)

for a 4.5 hour timeout. I'm mostly trying to solve the second one here. I see that some flows do actually hit this timeout and others will run for 9 hours and still not hit it. In the documentation I see a note "Flow execution may continue until the next task is called", and I'm wondering if maybe this could be causing my issue? I'm not actually using tasks per se, but I do notice that in the flows that do not time out there are no logs for hours vs the ones that do time out, there are logs right up until the timeout. Ironically that's the opposite of the behavior I want, which would be timing out iff there is no activity for a long time, but that's neither here nor there. Any help would be much appreciated!

Bianca Hoch

02/04/2025, 2:40 PM

Hi Brian! What kind of infrastructure are you using for your flow runs? In some cases where this has been reported, the underlying infrastructure has failed and can result in flows being stuck in a

Running

state indefinitely. A feature called Runner Hearbeats was introduced in version 3.1.8 which could help. It requires a little bit of setup at first, which is outlined in the release notes.

Brian Oldak

02/04/2025, 4:45 PM

Hey Bianca, we're running a prefect worker (2.20.4) in our Google Kubernetes Engine cluster.

👍 1

Bianca Hoch

02/04/2025, 7:10 PM

Gotcha. The switch from 2.0 to 3.0 should be pretty straightforward since you're already using a worker. That way you can take advantage of the heartbeats feature. As far as why the timeouts defined in the flow decorator aren't being enforced, my suspicion is that the pods could have been evicted or restarted during the flow run's execution.

Brian Oldak

02/04/2025, 7:11 PM

yeah i thought the same, but i don't see any restarts and the pod is still in a running state. i may try upgrading to 3.x though

Bianca Hoch

02/04/2025, 7:14 PM

Ah, good thing you checked. Yup, try the 3.0 upgrade. Another thing you could try (at least while you're running 2.0 for the time being) is setting up an automation to enforce an SLA. ie: flow runs that are in a

Running

state for longer than 5 hours shall be marked as failed. You can add an additional action on top that sends you a notification whenever this occurs.

Bianca Hoch

02/04/2025, 7:15 PM

That way, even if the proverbial infrastructure-rug is pulled out from under your flow run, the server is able to monitor the flow run and handle the state.

Brian Oldak

02/04/2025, 7:15 PM

interestingly i had tried setting up an automation before using the flow timeout and it was just not being triggered consistently either. screenshot of what i had

👀 1

Brian Oldak

02/04/2025, 7:16 PM

tbf my previous long runs were running into an issue with 410 exceptions where it seemed like the prefect flow was losing track of the pod associated

🤔 1

Brian Oldak

02/04/2025, 7:16 PM

so i was thinking perhaps that is why the automation wasn't working

Brian Oldak

02/04/2025, 7:17 PM

but now that i've upgraded to 2.20.4, i'm no longer seeing the 410s

Brian Oldak

02/04/2025, 7:18 PM

https://prefect-community.slack.com/archives/C04DZJC94DC/p1732293501329719 <- thread i followed to resolve that

gratitude thank you 1

👀 1

Bianca Hoch

02/04/2025, 7:24 PM

That could very well be the case why that automation wasn't working to begin with. May be worth re-creating it again. If it doesn't work, a bug report would be appreciated 🙏

141 Views

Open in Slack

Previous Next