Hi everyone I have a question regarding crashed tasks and ta Prefect Community #ask-community

Hi everyone, I have a question regarding crashed t...

Giorgio Basile

05/24/2023, 2:12 PM

Hi everyone, I have a question regarding crashed tasks and task retries on Prefect 2.10.10. I noticed that task retry only works for tasks that go in FAILED state, whereas it doesn't for tasks in CRASHED, I was just wondering why is that and if I have any option to make it retry in case of a crash. Thanks!

👀 1

Zanie

05/24/2023, 3:51 PM

CRASHES indicate a failure of the infrastructure or the Prefect engine itself — so we can’t retry those.

Giorgio Basile

05/25/2023, 7:11 AM

Thank you for clarifying, just another comment on this. The task is part of a flow executed by an agent using a Process Infrastructure. Although the task is crashed, the flow doesn't crash and it continues executing normally, managing the fact that the task is not completed. What is happening exactly? Is the task executed in a different process than the flow it belongs to, so that it can continue regardless of the task's crash? Thanks again!

Zanie

05/25/2023, 2:20 PM

What’s the message / error for the crashed task?

Zanie

05/25/2023, 2:20 PM

Which task runner are you using?

Giorgio Basile

05/25/2023, 2:22 PM

I'm using the

ConcurrentTaskRunner

and this is the error I'm getting:

Zanie

05/25/2023, 2:24 PM

Interesting that it’s getting cancelled — are you using async with timeouts?

Giorgio Basile

05/25/2023, 2:31 PM

The task is a "normal" function - no async, with a very long timeout - which is not reached. Something that I'm still investigating and quite misleading is that the Prefect task opens a connection towards a Dask cluster and executes a graph. One of the Dask tasks in the graph is that

store-map

you see in the log. So, it seems that the Dask task is cancelled at the Dask level, and that is propagated to the Prefect task, which happens to be crashing as a result :(

Zanie

05/25/2023, 2:33 PM

Ah that makes more sense

Zanie

05/25/2023, 2:34 PM

I’d recommend catching the cancelled error that Dask is throwing

Zanie

05/25/2023, 2:35 PM

We special case cancellation as a CRASH instead of a FAILURE because often the expectation is that you should not retry on cancellation.

Zanie

05/25/2023, 2:35 PM

If you want to retry, you can just raise an exception again as a different type or return a failed state manually.

👍 1

Giorgio Basile

05/25/2023, 2:40 PM

At the same time, I cannot find that exception in the dask scheduler/workers logs, and that's pretty weird too. Regardless, thank you for taking the time and explaining the rationale behind the assignment of the crashed state :)

Zanie

05/25/2023, 2:41 PM

Was there a worker eviction or something?

Zanie

05/25/2023, 2:41 PM

The cancelled error is kind of an internal Dask error that’s being thrown when the task is moved to another worker, I think.

Giorgio Basile

05/25/2023, 2:49 PM

I cannot see anything suspicious in the Dask logs unfortunately. I'm not sure this happens in the situation you are describing, but I'll keep investigating, thank you again

13 Views

Open in Slack

Previous Next