Nimesh Kumar
02/05/2025, 12:02 PMBianca Hoch
02/05/2025, 10:19 PMCrashed
. Here's some docs to help you with that!Bianca Hoch
02/05/2025, 10:20 PMRunning
for longer than X amount of time is also a possibility.Bianca Hoch
02/05/2025, 10:22 PMNimesh Kumar
02/06/2025, 5:01 AMAnibal Rivero
02/06/2025, 11:09 AMBianca Hoch
02/06/2025, 9:07 PMSince we are using spot instances in production, my expectation is that when a worker goes down, another available worker retries the flow. Is this doable with Prefect?Hi Anibal! My understanding is that it was a design decision to decouple flow runs from worker health in an order to minimize failure, that way the flow run submitted by the worker can continue to execute independently of the worker's state. With other types of workers (docker, k8s, ECS, etc.), this distinction makes sense since the flow run is running on separate infrastructure from the worker that submitted it. With process workers, however, this design choice doesn't apply as nicely (since the work and worker are running in the same process). Another user actually opened an issue here for us to explore a way of handling subprocess cancellation a bit better whenever a process worker goes down. You're welcome to follow along with that issue. FWIW, I tested an automation that looks for flows submitted by a process worker that are stuck in a
Running
state for longer than X amount of time and marks them as Crashed
, and it worked. It should work for you as well!Bianca Hoch
02/06/2025, 9:10 PM