Hey all, I'm having some issues where flow runs wi...
# ask-community
b
Hey all, I'm having some issues where flow runs will hang indefinitely. There seem to be two issues: 1. Root cause of why sometimes a flow will just run indefinitely and stop logging after some amount of time 2. Flow not timing out even though I've configured the flow with
@flow(timeout_seconds=16200)
for a 4.5 hour timeout. I'm mostly trying to solve the second one here. I see that some flows do actually hit this timeout and others will run for 9 hours and still not hit it. In the documentation I see a note "Flow execution may continue until the next task is called", and I'm wondering if maybe this could be causing my issue? I'm not actually using tasks per se, but I do notice that in the flows that do not time out there are no logs for hours vs the ones that do time out, there are logs right up until the timeout. Ironically that's the opposite of the behavior I want, which would be timing out iff there is no activity for a long time, but that's neither here nor there. Any help would be much appreciated!
b
Hi Brian! What kind of infrastructure are you using for your flow runs? In some cases where this has been reported, the underlying infrastructure has failed and can result in flows being stuck in a
Running
state indefinitely. A feature called Runner Hearbeats was introduced in version 3.1.8 which could help. It requires a little bit of setup at first, which is outlined in the release notes.
b
Hey Bianca, we're running a prefect worker (2.20.4) in our Google Kubernetes Engine cluster.
👍 1
b
Gotcha. The switch from 2.0 to 3.0 should be pretty straightforward since you're already using a worker. That way you can take advantage of the heartbeats feature. As far as why the timeouts defined in the flow decorator aren't being enforced, my suspicion is that the pods could have been evicted or restarted during the flow run's execution.
b
yeah i thought the same, but i don't see any restarts and the pod is still in a running state. i may try upgrading to 3.x though
b
Ah, good thing you checked. Yup, try the 3.0 upgrade. Another thing you could try (at least while you're running 2.0 for the time being) is setting up an automation to enforce an SLA. ie: flow runs that are in a
Running
state for longer than 5 hours shall be marked as failed. You can add an additional action on top that sends you a notification whenever this occurs.
That way, even if the proverbial infrastructure-rug is pulled out from under your flow run, the server is able to monitor the flow run and handle the state.
b
interestingly i had tried setting up an automation before using the flow timeout and it was just not being triggered consistently either. screenshot of what i had
👀 1
tbf my previous long runs were running into an issue with 410 exceptions where it seemed like the prefect flow was losing track of the pod associated
🤔 1
so i was thinking perhaps that is why the automation wasn't working
but now that i've upgraded to 2.20.4, i'm no longer seeing the 410s
gratitude thank you 1
👀 1
b
That could very well be the case why that automation wasn't working to begin with. May be worth re-creating it again. If it doesn't work, a bug report would be appreciated 🙏