Sorry about the noob question If an agent dies stops being r Prefect Community #ask-community

Sorry about the noob question: If an agent dies/st...

John Muehlhausen

09/27/2021, 6:27 PM

Sorry about the noob question: If an agent dies/stops being responsive in the middle of running a task (not the same as a confirmed failure) will the scheduler schedule it to run again? And, is this behavior configurable?

Kevin Kho

09/27/2021, 6:29 PM

Hey @John Muehlhausen, not a noob question and this is dependent on the agent type. The flows run by a local agent will die but all the others will continue. Are you using Dask also?

John Muehlhausen

09/27/2021, 7:03 PM

For reasons of configuration complexity, if a local agent goes away I need the work to be run by a different agent with the same label. Due to the complexity of our authentication/authorization system only local agent seems feasible.

John Muehlhausen

09/27/2021, 7:04 PM

By "goes away" I mean the agent dies unexpectedly, like the hardware it is running on starts smoking.

Kevin Kho

09/27/2021, 7:13 PM

Does your hardware smoke often? 😅 The flow retry behavior varies by run config so I’m not 100%. I am pretty positive that the behavior is this: 1. The heartbeat is lost 2. The flow is identified as a Zombie and the Zombie Killer kills it 3. Lazarus redeploys the flow 4. A new agent with similar labels will be able pick up the flow retry This is different from a Flow raising an explicit error. That will not be retried by Lazarus.

Kevin Kho

09/27/2021, 7:14 PM

Check this for Lazarus and Zombie Killer

John Muehlhausen

09/27/2021, 7:35 PM

Yes I'm only looking for automatic retry if the agent disappears mysteriously

John Muehlhausen

09/27/2021, 7:37 PM

To answer your question, in an emergency scenario our system may reallocate some local agent hardware to other purposes.

Kevin Kho

09/27/2021, 7:38 PM

Wow on an on-prem system? That’s pretty cool. My last job tried to do that but didnt really

John Muehlhausen

09/27/2021, 8:57 PM

It would be nice if Lazarus and Zombie killer were configurable from flow.register(). It is impossible to guess what the user wants done when heartbeats are lost in various states. For some long-running task that would be expensive to retry, it is worth waiting longer to see if a transient networking issue between agent and cloud resolves (this would assume, of course, that agents try to reconnect once connectivity is restored). For a short-running task it might be better to retry pretty quickly on a different agent with compatible labels, and if the other agent ever comes back tell it immediately to kill the abandoned task if it is still running. For that matter, in the second case the user may want to configure the agent itself to kill all tasks once the heartbeat to cloud is lost from the agent's perspective. All of this needs to be configurable for Prefect to reach production status for us. Are there tickets for this?

John Muehlhausen

09/27/2021, 9:00 PM

configure the agent itself to kill all tasks

In the affected flow, I mean

Kevin Kho

09/27/2021, 9:06 PM

It is in the roadmap from what I know. So from what I see here, all of this is possible if the number of retry attempts of Lazarus is configurable right? If you don’t want it to retry, you can configure the number to 0 and if you want it to retry, you could do any number. I think this has been in discussion, but no immediate timeline for it. I believe this topic is tied to this announcement. But I am honestly not sure either it is reliable to depend on these services like that. For example, the flow could raise an OOM error. That is different than like a system exit.

Anna Geller

10/13/2021, 11:18 AM

@John Muehlhausen just wanted to chime in and mention some other ideas that might be helpful to solve your problem. 1. If you use KubernetesAgent or ECSAgent, then it’s much easier to achieve high availability, because the agent process can run as a service (i.e. Kubernetes service or ECS service) that can be automatically restarted if agent becomes unhealthy 2. Prefect Cloud has the Automations that allow you to take action if some agent becomes unhealthy. This could be either sending a message to the Ops team, or even trigger an automated procedure via WebhookAction to set up a new agent or restart the existing one automatically Option #1 is probably easier to configure.

5 Views

Open in Slack

Previous Next