Is there a way to do retries on entire/scheduled f...
# ask-community
h
Is there a way to do retries on entire/scheduled flows ? Retries on tasks work well, but I have had a flow fail because the executor is a remote cluster - so it doesn’t even reach the Task stage. In my case, a coiled cluster failed to spin up for some reason (actually first time that happened).
k
Hey @Henning Holgersen, the issue we’ve had so far is the flow level retries mean different things to different people. Did the Lazarus process kick in for your logs? In cases where you don’t get infrastructure, I think the Lazarus process should attempt to retry
h
No trace of Lazarus, the logs note a
coiled.errors.ServerError: Could not launch scheduler for dask cluster
error. Coiled dashboard shows signs of a cluster at that time, but not actually running. So it’s consistent like that…
k
Gotcha. Actually I bumped into this ServerError myself yesterday. Will ask the team what ideas they have.
Ok so this situation is a bit tricky to get to restart automatically. This is because doing a blind restart with a state handler upon failure would also apply when the Flow fails due to data errors, which could cause an infinite loop. You would need something like this. 1. Create a record in the KV Store and set it to
true
2. Make the first task of the flow set the flag in the KV Store to false. This is your indication that the executor came up successfully 3. If the flow fails and the KV Store flag is still
true
, this shows the executor didn’t start and then you can either
create_flow_run
to kick it off again or
set_flow_state
to change it from Failed to Scheduled to run again