Hi, we just had a two flows becoming stuck without...
# ask-community
f
Hi, we just had a two flows becoming stuck without any apparent reason. All tasks had been finished but the flow would remain as running, therefore blocking others. Any idea how to debug what was going on? Manually setting the state resolved the problem.
a
It’s hard to tell the root cause without seeing the logs, but usually when a flow run is stuck in a Running state, it might be a flow’s heartbeat issue. This thread explains the issue and shows some possible solutions you may try.
👍 1
What do you mean by blocking others? Generally speaking, even if one flow run takes longer than usual or is stuck, it’s not blocking future scheduled runs since flow runs are independent of each other.
f
we have limits for some tasks that access certain data bases etc, thats why it was blocking.
a
Gotcha, that makes sense.
f
Looking at the docs the zombie killer attacks after 2 minutes. But as you can see in the screenshot the process was running for much longer after being finished. Am I overlooking something?
I think also Lazarus should have come by at least once?
a
That’s roughly correct, Prefect Core sends heartbeats of registered flows to the API every 30 seconds. These heartbeats are used to confirm the flow run and its task runs are healthy, and runs missing 4 heartbeats in a row are marked as failed by the Zombie Killer. Lazarus is more important for flow runs stuck in a Submitted state e.g. when a Kubernetes pod for a flow run cannot be spun up
f
So does it mean we can rule out an hearbeat issue?
a
on the contrary, it’s the most likely issue here. Did you check the logs? Is there no mention about flow’s heartbeat in the flow run log?
I would encourage you to try the steps shared in the thread above
f
No there is nothing about any problems in the log. We will try what you suggested should we see this again, but for me this looks like a different problem so far.
a
if you don’t find a solution and it happens again, feel free to share your flow definition. Sometimes when DB connections or HTTP clients are used in a wrong way, it may cause a similar issue too, especially with mapping.
f
I just saw we also have an automation in place to cancel runs of that flow that take longer than 600s to finish. But apparently this also did not do anything here. For me this looks more and more as something going on in prefect cloud.
a
Can you DM me your flow run ID and the flow definition? I could then look at it in more detail
Thanks for sharing the flow run ID. Can you additionally share the run config and executor to check if this may be somehow related to a specific infrastructure issue?