Hi folks a question about lazarus processes and running an d Prefect Community #ask-community

Join Slack

Hi folks, a question about lazarus processes and r...

# ask-community

Marwan Sarieddine

06/08/2021, 4:13 PM

Hi folks, a question about lazarus processes and running an dask-kuberenetes executor via run configs …

Marwan Sarieddine

06/08/2021, 4:13 PM

I get the purpose of the lazarus process is to restart “distressed” flow runs in a constant interval of ~10 minutes. In our usecase we are running dask-kubernetes on AWS EKS and we are deploying the scheduler in “local” mode which it seems not to play well with lazarus (I have started testing this out today and I can’t get lazarus to run - even after waiting for 20 minutes on a distressed flow run).

Marwan Sarieddine

06/08/2021, 4:14 PM

So what happens is for a given flow run two main pod types/specs are used: a job pod where the scheduler is also run in and a worker pod where the dask workers are run. Our most common infrastructure-related issues occur due to cluster failing to schedule the worker pod not the job pod (due to a certain failure to provision resources (compute/volumes/secrets ..)). Unfortunately in this setup the lazarus process is never triggered given the flow run is hanging in a “Running” state instead of a “Submitted” state from what I can tell.

Marwan Sarieddine

06/08/2021, 4:14 PM

I am wondering if someone else has experienced this - or if I am missing something here

Jenny

06/08/2021, 4:35 PM

Hi @Marwan Sarieddine - do you have heartbeats enabled on your flow?

Marwan Sarieddine

06/08/2021, 4:38 PM

Hi Jenny I do but I don’t see hearbeat failures being triggered

Jenny

06/08/2021, 4:40 PM

That was going to be my next question! So to double check - you're not seeing anything from Zombie Killer either? Do the tasks get started (enter a running state)?

Marwan Sarieddine

06/08/2021, 4:42 PM

The flow is stuck in a “Running” state - all the tasks are pending

Jenny

06/08/2021, 4:43 PM

Ah - that would explain why Zombie Killer doesn't get them.

Jenny

06/08/2021, 4:52 PM

Let me check if the team has any ideas on this one.

👍 1

Marwan Sarieddine

06/08/2021, 7:39 PM

hmm - as an update this same behavior seems to take place when deploying the schedule remote mode

Marwan Sarieddine

06/08/2021, 7:40 PM

will verify this in a bit

👍 1

Jenny

06/08/2021, 7:50 PM

Hi Marwan, checked with the team and it sounds like Lazarus is behaving as expected here. If your flow runner is sending heartbeats, Lazarus has no way to know to step in. If you have Cloud Flow SLA you could use that for a run that does not finish after a certain time e.g. in an automation to cancel the run.

Marwan Sarieddine

06/08/2021, 7:56 PM

thanks for taking the time to check with them that’s unfortunate but explains things - given the CloudFlowRunner is connecting to cloud because it starts running on the job pod …

Marwan Sarieddine

06/08/2021, 7:57 PM

just one thing is in the old way of doing things with prefect Environments, the pod that has the CloudFlowRunner is the second pod to run and as such has a better experience with lazarus

Marwan Sarieddine

06/08/2021, 8:00 PM

so the docs on lazarus state:

Copy code

"Where necessary, flow runs without submitted or running task runs will be rescheduled by the Lazarus process up to 10 times."
<https://docs.prefect.io/orchestration/concepts/services.html#how-does-it-work>

And sadly in our case of no running tasks - i.e. all tasks remain in pending mode we can’t really rely on lazarus …

Marwan Sarieddine

06/08/2021, 8:05 PM

So we do have access to flow-level SLAs via automations - it is far from ideal given with lazarus one would know something is up/ get notified after 10 minutes, versus in a long-running flow one won’t know there is an issue until the full expected run time has elapsed (i.e. up to 60~ 120 minutes in our case)

Jenny

06/08/2021, 8:28 PM

Hmm... you're right about that section in the docs. Let me double check that as there is some mismatch of expectations.

3 Views

Open in Slack

Previous Next