Lukas Brower

12/06/2021, 3:06 PM
Hey everyone, my team is seeing issues with executing long running flows, and we aren’t sure if the issue lies in Prefect or the underlying dask scheduler. We have some flow runs that take upwards of 12 hours to complete at times. These runs reliably fail somewhere in the 11:30 -12 hour mark with a top level
We are trying to determine if there is something in Prefect or dask which kills the dask scheduler when execution nears the 12 hour mark. I don’t see any major resource constraints in workers at the time of the flow cancellations, so I figured I would start here and ask if there is anything special about the 12 hour mark in Prefect, or if this is likely a dask-specific issue.

Kevin Kho

12/06/2021, 3:10 PM
Hi @Lukas Brower, Prefect has no default timeouts built in and this error really looks like a Dask one. Are you mapping a large amount of tasks? That can kill the scheduler because there is some memory bloat with the DaskExecutor that we are working on
Large amount meaning maybe 50k tasks and above. If you just have long tasks, then I’m not seeing anything on the Prefect side.

Lukas Brower

12/06/2021, 3:14 PM
Got it, thanks for the info Kevin. We only have ~10 tasks max running at a time usually, and I have seen this fail even when just a single worker is executing a long running task and all others are idle. I will focus on looking into the dask side of things.

Anna Geller

12/06/2021, 3:16 PM
@Lukas Brower not sure if this can help, but Dask performance reports have been recently added to Prefect - sharing in case it might help with debugging

Lukas Brower

12/06/2021, 3:46 PM
Awesome I’ll try that, thanks Anna