Hi! I have a flow running on preemptible nodes on ...
# ask-community
c
Hi! I have a flow running on preemptible nodes on GKE, which looks to have been preempted and subsequently caught by the Zombie Killer. My flow was then marked as failed but I expected the Lazarus process to pick it up and restart the flow after 10 minutes. As seen in the screenshot the Zombie killer fails the flow at 13:48 and then nothing until I manually restart the flow via the Prefect Cloud UI at 14:45 about an hour later. I confirmed that Lazarus is enabled for this flow. Is there anything else I need to do to have Lazarus pick this up?
Here are the same logs at the flow level rather than the task level in case useful
k
Hi @Cab Maddux! I’ll look into this for you. Can you tell me what Agent you are using?
c
Thanks Kevin, we're using the k8s agent
k
I may have to get back to you on this on Monday, but I won’t forget
👍 1
Hi @Cab Maddux, Lazarus will not pick up failed flows. Lazarus picks up a flow run for two reasons: 1. its in a submitted state for a long time 2. its in a running state and stops sending a heartbeat for some period of time
c
Thanks @Kevin Kho yeah, my thought is that this falls under case #2, right? (I've attached where in the docs it looks like in this case the Lazarus process should be restarting this flow - due to a No Heartbeat error)
Unless I've misunderstood something here
k
Yeah, I’ll need to get some clarification from some team members that are already out. Will get back on Monday for sure.
c
No worries, thanks!
n
Hi @Cab Maddux - can you confirm that your flow was in a
Running
state when the zombie task was marked as
Failed
? The Lazarus process will only restart zombie task runs if the flow is still running
(I'm having a little trouble deciphering that from the logs you posted, apologies if you've already checked this!)
c
Hi @nicholas yeah I think the flow was in a running state when the zombie task marked as failed. If you look at the first screenshot in this thread it looks like the
basecall_fast5_file
task finished in a running state (I'm assuming when the underlying node was preempted) and then the zombie killer catches it 2 minutes, 25 seconds later (I think heartbeats are every 30 seconds right so would sort of make sense that's 4 heartbeats missed)
Everything I can tell seems to indicate the flow is in a running state when the zombie task comes in
n
Interesting, let me see if I can recreate this
Ah I apologize @Cab Maddux - I misspoke earlier and the behavior you saw was indeed correct. The Lazarus service doesn't re-submit failed task runs, instead it handles re-scheduling (or failing, after 10 tries) flow runs that are in running or submitted states that have no pending or running tasks. Without a fuller picture of the run I can't tell exactly why your flow run was marked as failed, but if you had other task runs pending/running at the time, Lazarus wouldn't pick your run and the run could be failed for other reasons (including because your task was failed by the zombie killer)
c
Hi @nicholas thanks so much, I'll do some additional digging on my end. Just to clarify, if we have a flow running on k8s fine and the underlying node gets preempted, we should expect that Lazarus would get that flow run restarted, right? Seems like that's a primary use case for Lazarus. If that's correct, sounds like we should expect to find something other than just preemption leading to what we're seeing
n
That sounds correct to me @Cab Maddux - let us know what you find and we can look to confirm further 👍