Morning, I am testing a crash recovery scenario with prefect 2. My workflow has a few steps, all of which do lengthy sleeps. I have an agent running on a compute engine instance (wont be our final deployment architecture but is convenient for testing). I suspended the compute engine instance mid flow run. The log in the prefect UI indicates "Crash detected!" however the TaskRun still shows as Running. After restarting the agent it looks like there is no automatic crash recovery, so in this scenario we would need to setup a flow timeout? Is there any way to resubmit a TaskRun, and do all the agents operate in this way?
Just double checked my workflow and I did have a timeout set on the flow, which has elapsed, so the timeout is checked by the agent?
03/31/2022, 10:10 AM
The Crashed state and handling such recovery scenarios are on the roadmap - they are not fully implemented and tested yet. But thanks for testing that already!
03/31/2022, 10:23 AM
Thanks Anna, is this implemented already in Prefect 1 - I'm thinking we would end up using the KubernetesAgent?
03/31/2022, 12:41 PM
You can’t resubmit a Task Run individually, but you can click the restart button in the UI to spin up all Failed tasks of a Flow in Prefect 1