https://prefect.io logo
Title
s

Srini

03/24/2023, 1:21 PM
Is there a doc on Prefect's fault tolerance and recovery? I ran the following simple experiment which seems to suggest that Prefect cannot recover from server failures. So, I feel l'm missing something. The experiment: • Run the server • Run a flow (infinite loop that does nothing) in terminal -> Flow run state changes to
Running
• Kill the server • Kill the flow run in the terminal • Re-start the server -> Flow run state is still in
Running
My question is: Will the server notice that the flow has been running longer than the specified
timeout_seconds
and recover from it (have a state that will be retried)? Deployments have the same behaviour too. If my observation is correct, I can open a github issue to track this
r

redsquare

03/24/2023, 1:31 PM
the jepsen test 🙂
s

Srini

03/24/2023, 2:14 PM
yeah, haha! This also seems problematic when the agent cannot gracefully shutdown
cc @Bianca Hoch @Sahil Rangwala this was the experiment that I tried to explain 😅
:gratitude-thank-you: 1
🙌 1
🚀 1
b

Bianca Hoch

03/28/2023, 9:25 PM
Hey @Srini! Would you mind creating an issue? A continued investigation into this would be interesting. It'd help with tracking purposes as well.
k

Kevin Wang

04/17/2023, 2:17 PM
Did you create an issue @Srini? I believe it's still an issue even in 2.10 (I've only tried agent restart.. not the new worker system) There's this open issue, due to lack of Heartbeat in Prefect 2 https://github.com/PrefectHQ/prefect/issues/7239