Hi Everyone, I am encountering an issue where bec...
# ask-community
n
Hi Everyone, I am encountering an issue where because of infra issue my agent docker container is restarting and whenever it happed the flows which are in running state stuck in the running state. Is it possible to apply flow or flow run level timeout. so that after a certain time the flow run automatically failed. ?
b
Hi Nimesh! An automation that enforces an SLA on your flow runs should help you out here. If you're using a version of Prefect that is >=3.1.8, you can use runner heartbeats and an automation that marks zombie flows as
Crashed
. Here's some docs to help you with that!
If you're using an older version, setting up an automation that cancels a flow run when it has been stuck
Running
for longer than X amount of time is also a possibility.
^IMO, these are better alternatives than setting a timeout at the flow-level. For infrastructure failures, flow-level timeouts aren't guaranteed to work
n
HI Bianca, thanks for the response. 1. Can we do automation without making any custom scheduler ? we don't want to built scheduler for this as this will increase the load on the local system. 2. Also just curious about about why prefect doesn't continue the running flow and restart it when the agent comes up which brings me to the ques "are prefect flow run not persistent"?
a
Hi! I am having the same problem, I tried the suggested trigger action for marking zombie flows as crashed but it is not working for me. I am using the "process" worker type. Since we are using spot instances in production, my expectation is that when a worker goes down, another available worker retries the flow. Is this doable with Prefect? This is currently working as expected with Celery.
b
Since we are using spot instances in production, my expectation is that when a worker goes down, another available worker retries the flow. Is this doable with Prefect?
Hi Anibal! My understanding is that it was a design decision to decouple flow runs from worker health in an order to minimize failure, that way the flow run submitted by the worker can continue to execute independently of the worker's state. With other types of workers (docker, k8s, ECS, etc.), this distinction makes sense since the flow run is running on separate infrastructure from the worker that submitted it. With process workers, however, this design choice doesn't apply as nicely (since the work and worker are running in the same process). Another user actually opened an issue here for us to explore a way of handling subprocess cancellation a bit better whenever a process worker goes down. You're welcome to follow along with that issue. FWIW, I tested an automation that looks for flows submitted by a process worker that are stuck in a
Running
state for longer than X amount of time and marks them as
Crashed
, and it worked. It should work for you as well!
> Can we do automation without making any custom scheduler ? we don't want to built scheduler for this as this will increase the load on the local system. Nimesh, not sure if I follow what you mean by creating an automation without making a custom scheduler. The automation is created server-side, and only needs to be made once. It shouldn't add to much load on your system. 🤔 > Also just curious about about why prefect doesn't continue the running flow and restart it when the agent comes up which brings me to the ques "are prefect flow run not persistent"? I think my explanation to Anibal about flow runs being decoupled from agent/worker health may help answer this q. If not, let me know!