How should we think about job resiliency during agent stoppa Prefect Community #ask-community

How should we think about job resiliency during ag...

Kevin Wang

10/17/2022, 5:32 PM

How should we think about job resiliency during agent stoppage and restart? When an agent stops or a flow run executor stops, how can we ensure the flow run is properly rerun, and NOT stuck in the Running state (with no execution)? I'm using local execution block, but this Github issue suggests remote runs on Kubernetes (and maybe ECS) are also a concern, for 'missing' jobs that don't get reruns. https://prefect-community.slack.com/archives/C03D12VV4NN/p1664554177863929?thread_ts=1664525816.785439&cid=C03D12VV4NN

Christopher Boyd

10/17/2022, 5:39 PM

Hi Kevin, I’m not 100% sure I follow - the agent itself is continuously polling for work that is scheduled and submitted. If the agent is stopped for some reason, it can’t pick up and execute new flow_runs, but the flow would still be ‘scheduled’. I don’t quite understand what you mean with the flow re-running / running state - if the agent is paused / stopped, then the agent won’t be able to submit the job for execution until the agent is re-running

Kevin Wang

10/17/2022, 6:29 PM

I observed for local execution, when the agent (and local thread performing the flow) get interrupted, the flow run that was Running is 'lost'. There's nothing actually performing the flow run anymore because it was terminated. That flow run is no longer on the queue and no more agent picks it up. It can never leave the 'Running' state, but really we wanted some agent to attempt this again. @Christopher Boyd I think that's what the other Prefect user on Github Issue 7116 is also seeing, but with Kubernetes.

👀 1

4 Views

Open in Slack

Previous Next