Hi we just had a two flows becoming stuck without any appare Prefect Community #ask-community

Hi, we just had a two flows becoming stuck without...

Florian Kühnlenz

01/17/2022, 9:37 AM

Hi, we just had a two flows becoming stuck without any apparent reason. All tasks had been finished but the flow would remain as running, therefore blocking others. Any idea how to debug what was going on? Manually setting the state resolved the problem.

Anna Geller

01/17/2022, 10:00 AM

It’s hard to tell the root cause without seeing the logs, but usually when a flow run is stuck in a Running state, it might be a flow’s heartbeat issue. This thread explains the issue and shows some possible solutions you may try.

👍 1

Anna Geller

01/17/2022, 10:02 AM

What do you mean by blocking others? Generally speaking, even if one flow run takes longer than usual or is stuck, it’s not blocking future scheduled runs since flow runs are independent of each other.

Florian Kühnlenz

01/17/2022, 10:10 AM

we have limits for some tasks that access certain data bases etc, thats why it was blocking.

Anna Geller

01/17/2022, 10:18 AM

Gotcha, that makes sense.

Florian Kühnlenz

01/17/2022, 10:22 AM

Looking at the docs the zombie killer attacks after 2 minutes. But as you can see in the screenshot the process was running for much longer after being finished. Am I overlooking something?

Florian Kühnlenz

01/17/2022, 10:25 AM

I think also Lazarus should have come by at least once?

Anna Geller

01/17/2022, 10:34 AM

That’s roughly correct, Prefect Core sends heartbeats of registered flows to the API every 30 seconds. These heartbeats are used to confirm the flow run and its task runs are healthy, and runs missing 4 heartbeats in a row are marked as failed by the Zombie Killer. Lazarus is more important for flow runs stuck in a Submitted state e.g. when a Kubernetes pod for a flow run cannot be spun up

Florian Kühnlenz

01/17/2022, 10:39 AM

So does it mean we can rule out an hearbeat issue?

Anna Geller

01/17/2022, 10:43 AM

on the contrary, it’s the most likely issue here. Did you check the logs? Is there no mention about flow’s heartbeat in the flow run log?

Anna Geller

01/17/2022, 10:43 AM

I would encourage you to try the steps shared in the thread above

Florian Kühnlenz

01/17/2022, 12:09 PM

No there is nothing about any problems in the log. We will try what you suggested should we see this again, but for me this looks like a different problem so far.

Anna Geller

01/17/2022, 12:26 PM

if you don’t find a solution and it happens again, feel free to share your flow definition. Sometimes when DB connections or HTTP clients are used in a wrong way, it may cause a similar issue too, especially with mapping.

Florian Kühnlenz

01/17/2022, 2:50 PM

I just saw we also have an automation in place to cancel runs of that flow that take longer than 600s to finish. But apparently this also did not do anything here. For me this looks more and more as something going on in prefect cloud.

Anna Geller

01/17/2022, 3:05 PM

Can you DM me your flow run ID and the flow definition? I could then look at it in more detail

Anna Geller

01/17/2022, 3:31 PM

Thanks for sharing the flow run ID. Can you additionally share the run config and executor to check if this may be somehow related to a specific infrastructure issue?

67 Views

Open in Slack

Previous Next