hi! is there a feature for restarting crashed flow...
# ask-community
l
hi! is there a feature for restarting crashed flows? I have infra that sometimes causes crashes and I want to immediately start the flow again
some more details: i'm self hosting on kubernetes and once in a while my pods get a SIGTERM and are restarted in some different location. if this happens during flow run I want to make sure that flow is re-queued and another agent can run the flow
t
hey @Lior Barak -- have you attempted to use the retry parameter? Attaching documentation link here
l
@Tess Dicker yeah tried the retry, but looks like it is implemented on the agent level? and in my case the agent no longer exists so the retry never happens maybe on the queue level?
t
You can try a hook when a crash occurs.
l
ah interesting! tried the on_crash hook but it doesn't seem to run. some logs:
Copy code
Received SIGTERM. Sending SIGINT to the Prefect agent (PID 7)...
Received SIGINT. Sending SIGINT to the Prefect agent (PID 7)...
13:05:03.823 | ERROR   | prefect.infrastructure.process - Process 'russet-bandicoot' exited with status code: -15; This indicates that the process exited due to a SIGTERM signal. Typically, this is caused by manual cancellation.
13:05:03.877 | INFO    | prefect.agent - Reported flow run 'da025a15-b6f5-4537-8a33-f7c3dcc9023c' as crashed: Flow run infrastructure exited with non-zero status code -15.
Agent stopped!

Aborted
looks like the on_crash hook is never called
t
Can you send over some the code for the hook?
l
Copy code
def crash_test_dummy(flow: FlowSchema, flow_run: FlowRun, state: State):
    print("crashed")
    context = FlowRunContext.get()
    logger = get_run_logger(context=context)
    <http://logger.info|logger.info>("crashed - test")
t
and you also have
@flow(on_crashed=[crash_test_dummy])
Hey @Lior Barak -- seems like the on_crashed hook is being updated at the moment and a fix will be deployed soon. That should fix the problem of the hook not being called. https://github.com/PrefectHQ/prefect/pull/11026
l
amazing! I'll keep an eye out for the fix 🙂
(also yes i'm using
@flow(on_crashed=[crash_test_dummy])
)