Hi everyone, I have a question regarding crashed t...
# ask-community
g
Hi everyone, I have a question regarding crashed tasks and task retries on Prefect 2.10.10. I noticed that task retry only works for tasks that go in FAILED state, whereas it doesn't for tasks in CRASHED, I was just wondering why is that and if I have any option to make it retry in case of a crash. Thanks!
👀 1
z
CRASHES indicate a failure of the infrastructure or the Prefect engine itself — so we can’t retry those.
g
Thank you for clarifying, just another comment on this. The task is part of a flow executed by an agent using a Process Infrastructure. Although the task is crashed, the flow doesn't crash and it continues executing normally, managing the fact that the task is not completed. What is happening exactly? Is the task executed in a different process than the flow it belongs to, so that it can continue regardless of the task's crash? Thanks again!
z
What’s the message / error for the crashed task?
Which task runner are you using?
g
I'm using the
ConcurrentTaskRunner
and this is the error I'm getting:
z
Interesting that it’s getting cancelled — are you using async with timeouts?
g
The task is a "normal" function - no async, with a very long timeout - which is not reached. Something that I'm still investigating and quite misleading is that the Prefect task opens a connection towards a Dask cluster and executes a graph. One of the Dask tasks in the graph is that
store-map
you see in the log. So, it seems that the Dask task is cancelled at the Dask level, and that is propagated to the Prefect task, which happens to be crashing as a result :(
z
Ah that makes more sense
I’d recommend catching the cancelled error that Dask is throwing
We special case cancellation as a CRASH instead of a FAILURE because often the expectation is that you should not retry on cancellation.
If you want to retry, you can just raise an exception again as a different type or return a failed state manually.
👍 1
g
At the same time, I cannot find that exception in the dask scheduler/workers logs, and that's pretty weird too. Regardless, thank you for taking the time and explaining the rationale behind the assignment of the crashed state :)
z
Was there a worker eviction or something?
The cancelled error is kind of an internal Dask error that’s being thrown when the task is moved to another worker, I think.
g
I cannot see anything suspicious in the Dask logs unfortunately. I'm not sure this happens in the situation you are describing, but I'll keep investigating, thank you again