Hey everyone, my team is seeing issues with executing long running flows, and we aren’t sure if the issue lies in Prefect or the underlying dask scheduler. We have some flow runs that take upwards of 12 hours to complete at times. These runs reliably fail somewhere in the 11:30 -12 hour mark with a top level
concurrent.futures._base.CancelledError
We are trying to determine if there is something in Prefect or dask which kills the dask scheduler when execution nears the 12 hour mark. I don’t see any major resource constraints in workers at the time of the flow cancellations, so I figured I would start here and ask if there is anything special about the 12 hour mark in Prefect, or if this is likely a dask-specific issue.
k
Kevin Kho
12/06/2021, 3:10 PM
Hi @Lukas Brower, Prefect has no default timeouts built in and this error really looks like a Dask one. Are you mapping a large amount of tasks? That can kill the scheduler because there is some memory bloat with the DaskExecutor that we are working on
Kevin Kho
12/06/2021, 3:11 PM
Large amount meaning maybe 50k tasks and above. If you just have long tasks, then I’m not seeing anything on the Prefect side.
l
Lukas Brower
12/06/2021, 3:14 PM
Got it, thanks for the info Kevin. We only have ~10 tasks max running at a time usually, and I have seen this fail even when just a single worker is executing a long running task and all others are idle. I will focus on looking into the dask side of things.
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.