Is there a doc on Prefect's fault tolerance and recovery? I ran the following simple experiment which seems to suggest that Prefect cannot recover from server failures. So, I feel l'm missing something.
The experiment:
• Run the server
• Run a flow (infinite loop that does nothing) in terminal -> Flow run state changes to
Running
• Kill the server
• Kill the flow run in the terminal
• Re-start the server -> Flow run state is still in
Running
My question is: Will the server notice that the flow has been running longer than the specified
timeout_seconds
and recover from it (have a state that will be retried)?
Deployments have the same behaviour too. If my observation is correct, I can open a github issue to track this
r
redsquare
03/24/2023, 1:31 PM
the jepsen test 🙂
s
Srini
03/24/2023, 2:14 PM
yeah, haha! This also seems problematic when the agent cannot gracefully shutdown
Srini
03/27/2023, 6:26 PM
cc @Bianca Hoch@Sahil Rangwala this was the experiment that I tried to explain 😅
gratitude thank you 1
🙌 1
🚀 1
b
Bianca Hoch
03/28/2023, 9:25 PM
Hey @Srini! Would you mind creating an issue? A continued investigation into this would be interesting. It'd help with tracking purposes as well.
k
Kevin Wang
04/17/2023, 2:17 PM
Did you create an issue @Srini? I believe it's still an issue even in 2.10 (I've only tried agent restart.. not the new worker system)
There's this open issue, due to lack of Heartbeat in Prefect 2 https://github.com/PrefectHQ/prefect/issues/7239
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.