What is the best practice to restart tasks which c...
# best-practices
t
What is the best practice to restart tasks which crashed due to resource time outs? I am using the
DaskTaskRunner
with
dask_jobqueue.SLURMCluster
. In this setting Dask is requesting compute resources which have a run-time limit. When the run-time limit is reached the resources are taken away and new resources are acquired. If a task is running when the resource is going down it will crash. What I would like to do is to re-submit the crashed task run to be executed on the newly acquired compute resources.
Adding
try-except
works, but does not feel like the prettiest solution. 😇