Hi Whenever the instance running prefect tasks crashes hearb Prefect Community #ask-community

Hi! Whenever the instance running prefect tasks cr...

Marie

07/13/2021, 2:55 PM

Hi! Whenever the instance running prefect tasks crashes, hearbeats are not detected anymore and the run should fail. Instead of this expected behavior only the task currently running fails, the next ones are still

Pending

and the run shows as

In progress

until I manually cancel it. Did anyone already run into this issue?

Copy code

No heartbeat detected from the remote task; marking the run as failed.

Kevin Kho

07/13/2021, 2:58 PM

Hi @Marie, what agent are you using? Docker agent/ Kubernetes agent?

Marie

07/13/2021, 2:58 PM

A docker agent

Kevin Kho

07/13/2021, 3:16 PM

Sorry I’m a bit confused. What do you mean the server if disconnected? Like Prefect Server stops running? How would you see the state of the tasks without it?

Marie

07/16/2021, 10:11 AM

Sorry, I was referring to the server running the prefect docker agent, not prefect server. I've updated my message above to reflect that

Marie

07/16/2021, 10:15 AM

So whenever the process running the flow misses 4 heartbeats in a row the current task fails but the flow does not. All of the downwards tasks stay in

Pending

until someone manually cancels the flow

Marie

07/16/2021, 2:38 PM

Hi @Kevin Kho , I added more details above, hopefully it helps understand the issue

Kevin Kho

07/16/2021, 2:57 PM

The Zombie Killer is the process that marks tasks as failed with the heartbeat. The flow run won’t be failed by the zombie killer, only the task run. The Lazarus process is the one responsible to reschedule that work or restart it but If the agent dies, then the work can’t be restarted or monitored. Our best advice here is that you might need to give your agent more resources to prevent the agent from dying to it can start the work again.

Marie

07/16/2021, 5:50 PM

Do you have plans to allow the zombie killer to fail the entire flow? Unfortunately the server on which the agent run can be restarted at random times. Since flows are still ongoing and there is a max number of flows that can run at the same time it prevent new flow from starting when the server comes back online.

Kevin Kho

07/16/2021, 5:55 PM

Are you on Prefect Server or Cloud?

Marie

07/19/2021, 10:23 AM

The docker agents are on our own servers

Kevin Kho

07/19/2021, 1:00 PM

Oh I meant the actual Prefect orchestration, not at the agents. Do you view the UI with cloud.prefect.io or do you use like port 8080 of a machine?

Marie

07/19/2021, 1:12 PM

Oh, sorry, I use the UI in the cloud

Kevin Kho

07/19/2021, 1:27 PM

So you can set automations marking them as Failed if they Flows have gone beyond a certain amount of time (depends what tier you are on). Since you mentioned you have flow concurrency, I think you should have Automations in your tier. Have you seen this before? You can also maybe use a startup script on that server that hits the GraphQL API to get the flows that are left hanging and fail them that way.

Marie

07/19/2021, 1:57 PM

Yes, I found the Automations! Thank you! I'll look into the GraphQL API when I have more time. It looks like this option would be cleaner and wouldn;t depend at all on flow processing time

👍 1

5 Views

Open in Slack

Previous Next