Hey just confirming because the docs aren t super clear For Prefect Community #ask-community

Hey, just confirming because the docs aren't super...

Ben Muller

02/07/2023, 6:59 PM

Hey, just confirming because the docs aren't super clear. For the flow decorators retry argument: 'An optional number of times to retry on flow run failure.' Can I confirm that this does NOT then include flows that fail due to a Crash state?

✅ 1

Christopher Boyd

02/07/2023, 7:00 PM

Crash is an infrastructure event

Christopher Boyd

02/07/2023, 7:00 PM

Fail is a flow event

Christopher Boyd

02/07/2023, 7:00 PM

that is correct

Christopher Boyd

02/07/2023, 7:00 PM

We don’t retry crashes

Ben Muller

02/07/2023, 7:01 PM

Thanks

Zanie

02/07/2023, 8:23 PM

(Or a CRASH is failure of the Prefect orchestration engine which means retrying isn’t possible)

Ben Muller

02/07/2023, 9:09 PM

Ahh I think that's what's happening to me. Is the retry possible through automations?

Tim-Oliver

02/08/2023, 8:07 AM

I was always missing this view-point (failure of infrastructure vs failure of task). I have a setup where a flow submits many tasks to a

DaskTaskRunner

which uses

dask_jobqueue.SLURMCluster

to obtain compute resources. The nice thing about this setup is the dynamic scaling of resources i.e. Dask can request more compute nodes dynamically. However, SLURM also limits the wall-time for every provided resources. Which can lead to the following scenario: • task is transfered to SLURM backed worker and starts running • SLURM takes back resources and shuts down the worker • task-run crashes due to infrastructure failure (which now makes a lot of sense to me) When some tasks, which were not started, are still left the

DaskTaskRunner

will automatically spin up a new worker on a fresh SLURM node and submit the tasks there. However, the single crashed task will not be re-submitted and fail the whole workflow. I found one work-around so far, which uses sub-flows and batches the tasks into batches which should fit within the time-limit. But this feels a bit counter-prefect, since I have to "think" about the correct orchestration. I would be curious to hear your insights on this and what the recommended pattern would be.

Parwez Noori

02/08/2023, 10:52 AM

Hi Ben, Yes a retry option is possible through automations. We are currently using it. You can "infer" which flow has crashed, then rerun.

3 Views

Open in Slack

Previous Next