james.lamb
04/05/2021, 7:40 PMThe Lazarus process is meant to gracefully retry failures caused by factors outside of Prefect's control. The most common situations requiring Lazarus intervention are infrastructure issues, such as Kubernetes pods not spinning up or being deleted before they're able to complete a run.
Once every 10 minutes, the Lazarus process searches for distressed flow runs. Each flow run found in this manner is rescheduled; this intervention by Lazarus is reflected in the flow run's logs.Where can I find more specifics on how a flow could enter this "distressed" state? For example, if I have a KubernetesAgent up, Prefect Cloud triggers a flow run, and then the agent isn't able to start up a k8s job because I misconfigured its RBAC stuff, would that be something Lazarus retries? Will give more context in thread.
PrefectHQ/prefect
, not in server
.Dylan
04/05/2021, 7:47 PMKevin Kho
04/05/2021, 7:47 PMjames.lamb
04/05/2021, 7:47 PMprefect run flow
).
I expected that by now, Lazarus would have tried to re-run it.Kevin Kho
04/05/2021, 9:31 PMjames.lamb
04/05/2021, 9:31 PMKubernetesAgent
on EKSprefect run flow
, then manually killed the flow run job in kubernetes as soon as it started (before any tasks could start). I wanted that to trigger Lazarus to re-submit the flow run, and it did!
I was just surprised that it took 16 minutes to restart the flow run. I expected it to be at WORST 10 minutes, since the docs say that service checks once every 10 minutes for distressed flows and since I only have a single flow in my tenant.
Is Lazarus a multi-tenant service? I guess if the load from all tenants can impact it, then it makes sense to me that I could see it take that long. Like if the code in Lazarus is like this pseudocode:
distressed_flow_runs = get_all_distressed_flow_runs_for_all_tenants()
for flow_run in distressed_flow_runs:
resubmit(flow_run)
Like if there is a queue of resubmissions that have to be worked through, I get how there could be a noticable delay for my tenant if my flow run isn't first in the queue.Dylan
04/05/2021, 9:39 PMjames.lamb
04/05/2021, 9:40 PMDylan
04/05/2021, 9:43 PM