We have some Prefect flows that sometimes need to run for many hours. For particularly long-running flows, once the flow has been running for more than 12 hours, we’re often seeing that flow fail before it completes -- the Prefect UI shows its last state message as “`Unexpected error: CancelledError()`“. This doesn’t happen as a result of the code we’ve written to launch or monitor flows. It appears to be a result of an action that Prefect (or Dask?) is taking to automatically cancel long-running flows. However, I don’t see anything in the Prefect or Dask docs indicating that this is expected behavior, or how it could be controlled (e.g., disabled, or increased the allowable duration, etc.).
Can anybody provide any guidance on how to deal with flows failing with this
? Any clues on how we can configure Prefect or Dask to allow flows to run past the 12 hour mark?
08/03/2021, 7:50 PM
Hey @Matt Klein, maybe this is due to a long running Dask task without output. See this . Maybe you can try increasing the timeout or occasionally logging? Does your flow have no output?
Could be this also where something is indeed failing?
It looks like the error itself is vague and can mean a few different things. What is your flow doing? Is it resource intensive that can cause memory issues? Is the timeout a constant time? Can it be replicated with a lower amount of time?
08/03/2021, 8:04 PM
Thanks much for the tips @Kevin Kho. We do have long periods where this task isn’t writing output; it sounds like a good lead to try writing more occasional logging info, and to bump up the Dask idle timeout. Will try those!