https://prefect.io logo
Title
b

Ben Muller

02/07/2023, 6:59 PM
Hey, just confirming because the docs aren't super clear. For the flow decorators retry argument: 'An optional number of times to retry on flow run failure.' Can I confirm that this does NOT then include flows that fail due to a Crash state?
1
c

Christopher Boyd

02/07/2023, 7:00 PM
Crash is an infrastructure event
Fail is a flow event
that is correct
We don’t retry crashes
b

Ben Muller

02/07/2023, 7:01 PM
Thanks
z

Zanie

02/07/2023, 8:23 PM
(Or a CRASH is failure of the Prefect orchestration engine which means retrying isn’t possible)
b

Ben Muller

02/07/2023, 9:09 PM
Ahh I think that's what's happening to me. Is the retry possible through automations?
t

Tim-Oliver

02/08/2023, 8:07 AM
I was always missing this view-point (failure of infrastructure vs failure of task). I have a setup where a flow submits many tasks to a
DaskTaskRunner
which uses
dask_jobqueue.SLURMCluster
to obtain compute resources. The nice thing about this setup is the dynamic scaling of resources i.e. Dask can request more compute nodes dynamically. However, SLURM also limits the wall-time for every provided resources. Which can lead to the following scenario: • task is transfered to SLURM backed worker and starts running • SLURM takes back resources and shuts down the worker • task-run crashes due to infrastructure failure (which now makes a lot of sense to me) When some tasks, which were not started, are still left the
DaskTaskRunner
will automatically spin up a new worker on a fresh SLURM node and submit the tasks there. However, the single crashed task will not be re-submitted and fail the whole workflow. I found one work-around so far, which uses sub-flows and batches the tasks into batches which should fit within the time-limit. But this feels a bit counter-prefect, since I have to "think" about the correct orchestration. I would be curious to hear your insights on this and what the recommended pattern would be.
p

Parwez Noori

02/08/2023, 10:52 AM
Hi Ben, Yes a retry option is possible through automations. We are currently using it. You can "infer" which flow has crashed, then rerun.