Hi! Whenever the instance running prefect tasks cr...
# ask-community
m
Hi! Whenever the instance running prefect tasks crashes, hearbeats are not detected anymore and the run should fail. Instead of this expected behavior only the task currently running fails, the next ones are still
Pending
and the run shows as
In progress
until I manually cancel it. Did anyone already run into this issue?
Copy code
No heartbeat detected from the remote task; marking the run as failed.
k
Hi @Marie, what agent are you using? Docker agent/ Kubernetes agent?
m
A docker agent
k
Sorry I’m a bit confused. What do you mean the server if disconnected? Like Prefect Server stops running? How would you see the state of the tasks without it?
m
Sorry, I was referring to the server running the prefect docker agent, not prefect server. I've updated my message above to reflect that
So whenever the process running the flow misses 4 heartbeats in a row the current task fails but the flow does not. All of the downwards tasks stay in
Pending
until someone manually cancels the flow
Hi @Kevin Kho , I added more details above, hopefully it helps understand the issue
k
The Zombie Killer is the process that marks tasks as failed with the heartbeat. The flow run won’t be failed by the zombie killer, only the task run. The Lazarus process is the one responsible to reschedule that work or restart it but If the agent dies, then the work can’t be restarted or monitored. Our best advice here is that you might need to give your agent more resources to prevent the agent from dying to it can start the work again.
m
Do you have plans to allow the zombie killer to fail the entire flow? Unfortunately the server on which the agent run can be restarted at random times. Since flows are still ongoing and there is a max number of flows that can run at the same time it prevent new flow from starting when the server comes back online.
k
Are you on Prefect Server or Cloud?
m
The docker agents are on our own servers
k
Oh I meant the actual Prefect orchestration, not at the agents. Do you view the UI with cloud.prefect.io or do you use like port 8080 of a machine?
m
Oh, sorry, I use the UI in the cloud
k
So you can set automations marking them as Failed if they Flows have gone beyond a certain amount of time (depends what tier you are on). Since you mentioned you have flow concurrency, I think you should have Automations in your tier. Have you seen this before? You can also maybe use a startup script on that server that hits the GraphQL API to get the flows that are left hanging and fail them that way.
m
Yes, I found the Automations! Thank you! I'll look into the GraphQL API when I have more time. It looks like this option would be cleaner and wouldn;t depend at all on flow processing time
👍 1