Ben Muller
02/07/2023, 6:59 PMChristopher Boyd
02/07/2023, 7:00 PMBen Muller
02/07/2023, 7:01 PMZanie
02/07/2023, 8:23 PMBen Muller
02/07/2023, 9:09 PMTim-Oliver
02/08/2023, 8:07 AMDaskTaskRunner
which uses dask_jobqueue.SLURMCluster
to obtain compute resources. The nice thing about this setup is the dynamic scaling of resources i.e. Dask can request more compute nodes dynamically. However, SLURM also limits the wall-time for every provided resources. Which can lead to the following scenario:
• task is transfered to SLURM backed worker and starts running
• SLURM takes back resources and shuts down the worker
• task-run crashes due to infrastructure failure (which now makes a lot of sense to me)
When some tasks, which were not started, are still left the DaskTaskRunner
will automatically spin up a new worker on a fresh SLURM node and submit the tasks there. However, the single crashed task will not be re-submitted and fail the whole workflow.
I found one work-around so far, which uses sub-flows and batches the tasks into batches which should fit within the time-limit. But this feels a bit counter-prefect, since I have to "think" about the correct orchestration.
I would be curious to hear your insights on this and what the recommended pattern would be.Parwez Noori
02/08/2023, 10:52 AM