Lukas Brower

    Lukas Brower

    9 months ago
    Hey everyone, my team is seeing issues with executing long running flows, and we aren’t sure if the issue lies in Prefect or the underlying dask scheduler. We have some flow runs that take upwards of 12 hours to complete at times. These runs reliably fail somewhere in the 11:30 -12 hour mark with a top level
    concurrent.futures._base.CancelledError
    We are trying to determine if there is something in Prefect or dask which kills the dask scheduler when execution nears the 12 hour mark. I don’t see any major resource constraints in workers at the time of the flow cancellations, so I figured I would start here and ask if there is anything special about the 12 hour mark in Prefect, or if this is likely a dask-specific issue.
    Kevin Kho

    Kevin Kho

    9 months ago
    Hi @Lukas Brower, Prefect has no default timeouts built in and this error really looks like a Dask one. Are you mapping a large amount of tasks? That can kill the scheduler because there is some memory bloat with the DaskExecutor that we are working on
    Large amount meaning maybe 50k tasks and above. If you just have long tasks, then I’m not seeing anything on the Prefect side.
    Lukas Brower

    Lukas Brower

    9 months ago
    Got it, thanks for the info Kevin. We only have ~10 tasks max running at a time usually, and I have seen this fail even when just a single worker is executing a long running task and all others are idle. I will focus on looking into the dask side of things.
    Anna Geller

    Anna Geller

    9 months ago
    @Lukas Brower not sure if this can help, but Dask performance reports have been recently added to Prefect - sharing in case it might help with debugging https://docs.prefect.io/orchestration/flow_config/executors.html#performance-reports
    Lukas Brower

    Lukas Brower

    9 months ago
    Awesome I’ll try that, thanks Anna