Hi we are using prefect together with dask for scheduling ML Prefect Community #ask-community

Hi, we are using prefect together with dask for sc...

Florian Kühnlenz

02/24/2022, 2:45 PM

Hi, we are using prefect together with dask for scheduling ML training runs. We use a temporary cluster. For big grid searches we see that the dask workers accumulate memory over time and then stop accepting tasks. I assume similar to this thread. However none of the recommended solutions helped. Has anyone had similar experience? Could this be related to prefect?

Kevin Kho

02/24/2022, 2:47 PM

Related to Prefect if you map around the 100k number of items. Actually are you trying models like XGBoost or LightGBM? Anything with

n_jobs

Kevin Kho

02/24/2022, 2:48 PM

Also results are held in memory so what do the tasks output?

Kevin Kho

02/24/2022, 2:49 PM

XGBoost and Lightgbm try to utilize all of the available cores on the machine they run on so if you send like two training jobs to a worker and they try to compete for the CPUs, you can get some conflict of resources and the jobs just deadlock

Florian Kühnlenz

02/25/2022, 1:01 PM

Sorry for the lag. The tasks do not have any output in fact. They just log to mlflow after training and then are done. There should be no competition about compute resources. The only problem is memory buildup.

Florian Kühnlenz

02/25/2022, 1:04 PM

But I have been wondering if there is some element of the flow that is present on all workers and holds on to some data and thats why we see no reduction in memory pressure when we run gc.collect().

Kevin Kho

02/25/2022, 4:00 PM

Have you seen

this▾

video? There is also this that was recently merged in I think that might affect this

Florian Kühnlenz

02/26/2022, 7:50 AM

Yes I watched it, but none of the things helped to debug our case. But if there is nothing prefect specific that you guys know about then we will have to dig into it.

Florian Kühnlenz

02/26/2022, 10:42 AM

I found this issue which sounds highly related: https://github.com/PrefectHQ/prefect/issues/3238

Kevin Kho

02/26/2022, 2:02 PM

Ah ok I thought this would be for mapping with high orders of magnitude. We’re looking into it for Orion though then maybe will backport a solution

Florian Kühnlenz

02/26/2022, 2:43 PM

No I think the mapping order of magnitude is rather low but the tasks are long running.

14 Views

Open in Slack

Previous Next