c

    Cab Maddux

    2 years ago
    Hi Prefect, I'm finding flows that have tasks that fail with 'No Heartbeat Detected' but the flow itself continues running (you can see in attached screenshots that heartbeat lost around 1:15AM but the flow continued running until manually marked as failed ~7 hours later). I believe previous Zombie Killer behavior is that the flow would have been immediately marked as failed. Is this an expected change to behavior?
    a

    Alex Cano

    2 years ago
    Are you running with Prefect Cloud or the Server option? The Server option does not yet have a Zombie Killer implemented, so this is expected behavior for that. If this is Cloud, someone else will need to speak to that!
    c

    Cab Maddux

    2 years ago
    @Alex Cano we are running with Prefect Cloud, so screenshots are from prefect.io
    Chris White

    Chris White

    2 years ago
    Hi @Cab Maddux - the zombie killer no longer marks Flow runs as failed, only task runs. I’m very surprised that Lazarus didn’t pick this up and mark it as failed automatically - do you happen to have the Lazarus process turned off? (You can check on the Flow Page > Settings)
    c

    Cab Maddux

    2 years ago
    Thanks @Chris White we had turned the Lazarus process off. Since we were still having issues with Zombie Killer, we were trying to move towards simply having the flow fail (without any Prefect driven retries via Lazarus process) if a task is zombie killed - and have implemented flow level retries based on that expected behavior on our end. We'll give turning on Lazarus process another try. Although with the zombie killer issue we've seen here I feel like its a little strange that the task was marked as failed but no other tasks even attempted to run (if so, they would have all hit a TriggerFailed situation and the flow would have failed). Seems that some state other than running at the flow level would make sense if no other tasks are going to run.
    Chris White

    Chris White

    2 years ago
    Gotcha - so remember that the zombie killer identifies tasks that have stopped reporting back; in order for their downstream dependencies to be triggered, the process running your tasks has to be “resurrected” which is what the Lazarus service takes care of
    Dylan

    Dylan

    2 years ago
    Hey @Cab Maddux, wanted to follow up on this. Did turning the Lazarus process back on resolve your issues?
    c

    Cab Maddux

    2 years ago
    Hey Dylan, thanks for checking in and sorry for the long delay. Turning the lazarus process on did allow these runs to fail rather than continuing to run. The flow attempts to rerun (I think previously we could have runs marked as failed without any attempt to rerun by turning zombine killer on and lazarus off - which gave us a bit more control) but we can work with this
    Dylan

    Dylan

    2 years ago
    @Cab Maddux What’s your preferred functionality? Would you like flows to fail and be re-run?