Is there a doc on Prefect's fault tolerance and recovery? I ran the following simple experiment whic...

Srini

03/24/2023, 1:21 PM

Is there a doc on Prefect's fault tolerance and recovery? I ran the following simple experiment which seems to suggest that Prefect cannot recover from server failures. So, I feel l'm missing something. The experiment: • Run the server • Run a flow (infinite loop that does nothing) in terminal -> Flow run state changes to

Running

• Kill the server • Kill the flow run in the terminal • Re-start the server -> Flow run state is still in

Running

My question is: Will the server notice that the flow has been running longer than the specified

timeout_seconds

and recover from it (have a state that will be retried)? Deployments have the same behaviour too. If my observation is correct, I can open a github issue to track this

redsquare

03/24/2023, 1:31 PM

the jepsen test 🙂

Srini

03/24/2023, 2:14 PM

yeah, haha! This also seems problematic when the agent cannot gracefully shutdown

Srini

03/27/2023, 6:26 PM

cc @Bianca Hoch @Sahil Rangwala this was the experiment that I tried to explain 😅

gratitude thank you 1

🙌 1

🚀 1

Bianca Hoch

03/28/2023, 9:25 PM

Hey @Srini! Would you mind creating an issue? A continued investigation into this would be interesting. It'd help with tracking purposes as well.

Kevin Wang

04/17/2023, 2:17 PM

Did you create an issue @Srini? I believe it's still an issue even in 2.10 (I've only tried agent restart.. not the new worker system) There's this open issue, due to lack of Heartbeat in Prefect 2 https://github.com/PrefectHQ/prefect/issues/7239

2 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.