Hey, just confirming because the docs aren't super...
# ask-community
b
Hey, just confirming because the docs aren't super clear. For the flow decorators retry argument: 'An optional number of times to retry on flow run failure.' Can I confirm that this does NOT then include flows that fail due to a Crash state?
1
c
Crash is an infrastructure event
Fail is a flow event
that is correct
We don’t retry crashes
b
Thanks
z
(Or a CRASH is failure of the Prefect orchestration engine which means retrying isn’t possible)
b
Ahh I think that's what's happening to me. Is the retry possible through automations?
t
I was always missing this view-point (failure of infrastructure vs failure of task). I have a setup where a flow submits many tasks to a
DaskTaskRunner
which uses
dask_jobqueue.SLURMCluster
to obtain compute resources. The nice thing about this setup is the dynamic scaling of resources i.e. Dask can request more compute nodes dynamically. However, SLURM also limits the wall-time for every provided resources. Which can lead to the following scenario: • task is transfered to SLURM backed worker and starts running • SLURM takes back resources and shuts down the worker • task-run crashes due to infrastructure failure (which now makes a lot of sense to me) When some tasks, which were not started, are still left the
DaskTaskRunner
will automatically spin up a new worker on a fresh SLURM node and submit the tasks there. However, the single crashed task will not be re-submitted and fail the whole workflow. I found one work-around so far, which uses sub-flows and batches the tasks into batches which should fit within the time-limit. But this feels a bit counter-prefect, since I have to "think" about the correct orchestration. I would be curious to hear your insights on this and what the recommended pattern would be.
p
Hi Ben, Yes a retry option is possible through automations. We are currently using it. You can "infer" which flow has crashed, then rerun.