Is there a way to do retries on entire/scheduled flows ? Retries on tasks work well, but I have had a flow fail because the executor is a remote cluster - so it doesn’t even reach the Task stage. In my case, a coiled cluster failed to spin up for some reason (actually first time that happened).
k
Kevin Kho
09/26/2021, 8:35 PM
Hey @Henning Holgersen, the issue we’ve had so far is the flow level retries mean different things to different people. Did the Lazarus process kick in for your logs? In cases where you don’t get infrastructure, I think the Lazarus process should attempt to retry
h
Henning Holgersen
09/28/2021, 7:07 AM
No trace of Lazarus, the logs note a
coiled.errors.ServerError: Could not launch scheduler for dask cluster
error. Coiled dashboard shows signs of a cluster at that time, but not actually running. So it’s consistent like that…
k
Kevin Kho
09/28/2021, 2:04 PM
Gotcha. Actually I bumped into this ServerError myself yesterday. Will ask the team what ideas they have.
Kevin Kho
09/28/2021, 3:13 PM
Ok so this situation is a bit tricky to get to restart automatically. This is because doing a blind restart with a state handler upon failure would also apply when the Flow fails due to data errors, which could cause an infinite loop. You would need something like this.
1. Create a record in the KV Store and set it to
true
2. Make the first task of the flow set the flag in the KV Store to false. This is your indication that the executor came up successfully
3. If the flow fails and the KV Store flag is still
true
, this shows the executor didn’t start and then you can either
create_flow_run
to kick it off again or
set_flow_state
to change it from Failed to Scheduled to run again
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.