Thread
#prefect-community
    c

    Cab Maddux

    1 year ago
    Hi! I have a flow running on preemptible nodes on GKE, which looks to have been preempted and subsequently caught by the Zombie Killer. My flow was then marked as failed but I expected the Lazarus process to pick it up and restart the flow after 10 minutes. As seen in the screenshot the Zombie killer fails the flow at 13:48 and then nothing until I manually restart the flow via the Prefect Cloud UI at 14:45 about an hour later. I confirmed that Lazarus is enabled for this flow. Is there anything else I need to do to have Lazarus pick this up?
    Here are the same logs at the flow level rather than the task level in case useful
    Kevin Kho

    Kevin Kho

    1 year ago
    Hi @Cab Maddux! I’ll look into this for you. Can you tell me what Agent you are using?
    c

    Cab Maddux

    1 year ago
    Thanks Kevin, we're using the k8s agent
    Kevin Kho

    Kevin Kho

    1 year ago
    I may have to get back to you on this on Monday, but I won’t forget
    Hi @Cab Maddux, Lazarus will not pick up failed flows. Lazarus picks up a flow run for two reasons:1. its in a submitted state for a long time 2. its in a running state and stops sending a heartbeat for some period of time
    c

    Cab Maddux

    1 year ago
    Thanks @Kevin Kho yeah, my thought is that this falls under case #2, right? (I've attached where in the docs it looks like in this case the Lazarus process should be restarting this flow - due to a No Heartbeat error)
    Unless I've misunderstood something here
    Kevin Kho

    Kevin Kho

    1 year ago
    Yeah, I’ll need to get some clarification from some team members that are already out. Will get back on Monday for sure.
    c

    Cab Maddux

    1 year ago
    No worries, thanks!
    nicholas

    nicholas

    1 year ago
    Hi @Cab Maddux - can you confirm that your flow was in a
    Running
    state when the zombie task was marked as
    Failed
    ? The Lazarus process will only restart zombie task runs if the flow is still running
    (I'm having a little trouble deciphering that from the logs you posted, apologies if you've already checked this!)
    c

    Cab Maddux

    1 year ago
    Hi @nicholas yeah I think the flow was in a running state when the zombie task marked as failed. If you look at the first screenshot in this thread it looks like the
    basecall_fast5_file
    task finished in a running state (I'm assuming when the underlying node was preempted) and then the zombie killer catches it 2 minutes, 25 seconds later (I think heartbeats are every 30 seconds right so would sort of make sense that's 4 heartbeats missed)
    Everything I can tell seems to indicate the flow is in a running state when the zombie task comes in
    nicholas

    nicholas

    1 year ago
    Interesting, let me see if I can recreate this
    Ah I apologize @Cab Maddux - I misspoke earlier and the behavior you saw was indeed correct. The Lazarus service doesn't re-submit failed task runs, instead it handles re-scheduling (or failing, after 10 tries) flow runs that are in running or submitted states that have no pending or running tasks. Without a fuller picture of the run I can't tell exactly why your flow run was marked as failed, but if you had other task runs pending/running at the time, Lazarus wouldn't pick your run and the run could be failed for other reasons (including because your task was failed by the zombie killer)
    c

    Cab Maddux

    1 year ago
    Hi @nicholas thanks so much, I'll do some additional digging on my end. Just to clarify, if we have a flow running on k8s fine and the underlying node gets preempted, we should expect that Lazarus would get that flow run restarted, right? Seems like that's a primary use case for Lazarus. If that's correct, sounds like we should expect to find something other than just preemption leading to what we're seeing
    nicholas

    nicholas

    1 year ago
    That sounds correct to me @Cab Maddux - let us know what you find and we can look to confirm further 👍