Hi folks, a question about lazarus processes and r...
# ask-community
m
Hi folks, a question about lazarus processes and running an dask-kuberenetes executor via run configs …
I get the purpose of the lazarus process is to restart “distressed” flow runs in a constant interval of ~10 minutes. In our usecase we are running dask-kubernetes on AWS EKS and we are deploying the scheduler in “local” mode which it seems not to play well with lazarus (I have started testing this out today and I can’t get lazarus to run - even after waiting for 20 minutes on a distressed flow run).
So what happens is for a given flow run two main pod types/specs are used: a job pod where the scheduler is also run in and a worker pod where the dask workers are run. Our most common infrastructure-related issues occur due to cluster failing to schedule the worker pod not the job pod (due to a certain failure to provision resources (compute/volumes/secrets ..)). Unfortunately in this setup the lazarus process is never triggered given the flow run is hanging in a “Running” state instead of a “Submitted” state from what I can tell.
I am wondering if someone else has experienced this - or if I am missing something here
j
Hi @Marwan Sarieddine - do you have heartbeats enabled on your flow?
m
Hi Jenny I do but I don’t see hearbeat failures being triggered
j
That was going to be my next question! So to double check - you're not seeing anything from Zombie Killer either? Do the tasks get started (enter a running state)?
m
The flow is stuck in a “Running” state - all the tasks are pending
j
Ah - that would explain why Zombie Killer doesn't get them.
Let me check if the team has any ideas on this one.
👍 1
m
hmm - as an update this same behavior seems to take place when deploying the schedule remote mode
will verify this in a bit
👍 1
j
Hi Marwan, checked with the team and it sounds like Lazarus is behaving as expected here. If your flow runner is sending heartbeats, Lazarus has no way to know to step in. If you have Cloud Flow SLA you could use that for a run that does not finish after a certain time e.g. in an automation to cancel the run.
m
thanks for taking the time to check with them that’s unfortunate but explains things - given the CloudFlowRunner is connecting to cloud because it starts running on the job pod …
just one thing is in the old way of doing things with prefect Environments, the pod that has the CloudFlowRunner is the second pod to run and as such has a better experience with lazarus
so the docs on lazarus state:
Copy code
"Where necessary, flow runs without submitted or running task runs will be rescheduled by the Lazarus process up to 10 times."
<https://docs.prefect.io/orchestration/concepts/services.html#how-does-it-work>
And sadly in our case of no running tasks - i.e. all tasks remain in pending mode we can’t really rely on lazarus …
So we do have access to flow-level SLAs via automations - it is far from ideal given with lazarus one would know something is up/ get notified after 10 minutes, versus in a long-running flow one won’t know there is an issue until the full expected run time has elapsed (i.e. up to 60~ 120 minutes in our case)
j
Hmm... you're right about that section in the docs. Let me double check that as there is some mismatch of expectations.