https://prefect.io logo
Title
j

james.lamb

04/05/2021, 7:40 PM
šŸ‘‹ hello from Chicago! At https://docs.prefect.io/orchestration/concepts/services.html#lazarus, I see this description of the Lazarus process
The Lazarus process is meant to gracefully retry failures caused by factors outside of Prefect's control. The most common situations requiring Lazarus intervention are infrastructure issues, such as Kubernetes pods not spinning up or being deleted before they're able to complete a run.
Once every 10 minutes, the Lazarus process searches for distressed flow runs. Each flow run found in this manner is rescheduled; this intervention by Lazarus is reflected in the flow run's logs.
Where can I find more specifics on how a flow could enter this "distressed" state? For example, if I have a KubernetesAgent up, Prefect Cloud triggers a flow run, and then the agent isn't able to start up a k8s job because I misconfigured its RBAC stuff, would that be something Lazarus retries? Will give more context in thread.
šŸ‘‹ 1
There are several situations I can think of where I want to know "would this caused a failed run or would Lazarus try to restart this?". I could try to simulate these situations myself to test, but it's kind of tedious because Lazarus only runs once every 10 minutes and because I might make mistakes that make my manually-triggered issues different from real situations flows will encounter. I'm happy to be told "go to this GitHub link for the code and RTFM".
blegh sorry for the noise, I found it and I think this is probably enough for me to figure it out: https://github.com/PrefectHQ/server/blob/master/src/prefect_server/services/towel/lazarus.py Was looking in
PrefectHQ/prefect
, not in
server
.
d

Dylan

04/05/2021, 7:47 PM
No need to apologize @james.lamb!
k

Kevin Kho

04/05/2021, 7:47 PM
Hey @james.lamb! I was about to point you to this link also.
ā¤ļø 1
j

james.lamb

04/05/2021, 7:47 PM
thanks friends
ok @Kevin Kho I actually do have a question about this. For the instance of Lazarus that runs with Prefect Cloud, specifically, is the interval "once every 10 minutes"? I've had a flow stuck in "submitted for execution" for 13 minutes now. The flow does not have a schedule (I just manually trigger it with
prefect run flow
). I expected that by now, Lazarus would have tried to re-run it.
oooo it was just picked up! But looks like more like a 16-minute difference between submission time and when it was retriggered. I understand that the "every 10 minutes" is probably based on the clock and not "Lazarus checks on each flow 10 minutes after it was first submitted", but I still would have expect that the absolute worst case would be waiting 9:59.99999, you know?
k

Kevin Kho

04/05/2021, 9:31 PM
What is your setup? AWS and EKS or ECS?
What this flow running before and this behavior is a new thing?
j

james.lamb

04/05/2021, 9:31 PM
ā€¢ Prefect Cloud ā€¢ Agent is a
KubernetesAgent
on EKS
oh no sorry, let me clarify. I'm trying to test what happens on a Lazarus-retried flow run because I think it might be different in a meaningful way from a "normal" flow run. So I triggered a flow run with
prefect run flow
, then manually killed the flow run job in kubernetes as soon as it started (before any tasks could start). I wanted that to trigger Lazarus to re-submit the flow run, and it did! I was just surprised that it took 16 minutes to restart the flow run. I expected it to be at WORST 10 minutes, since the docs say that service checks once every 10 minutes for distressed flows and since I only have a single flow in my tenant. Is Lazarus a multi-tenant service? I guess if the load from all tenants can impact it, then it makes sense to me that I could see it take that long. Like if the code in Lazarus is like this pseudocode:
distressed_flow_runs = get_all_distressed_flow_runs_for_all_tenants()
for flow_run in distressed_flow_runs:
     resubmit(flow_run)
Like if there is a queue of resubmissions that have to be worked through, I get how there could be a noticable delay for my tenant if my flow run isn't first in the queue.
The Lazarus-resubmitted flow run works perfectly, I'm just confused by how long it took to be triggered and think that might mean I don't understand the instance of this service running in Prefect Cloud.
d

Dylan

04/05/2021, 9:39 PM
Hi @james.lamb! Lazarus is a multi-tenant service that checks for Flow Runs with a heartbeat that is 10 or more minutes stale. The Lazarus process runs every 10 minutes, meaning it may take up to 20 minutes for a Flow Run to be recognized, depending on when you kick off the Flow Run in question
j

james.lamb

04/05/2021, 9:40 PM
oooo ok, that was exactly the correction to my expectations I needed. Thanks very much!
d

Dylan

04/05/2021, 9:43 PM
Anytime!