Hi, we are using prefect together with dask for scheduling ML training runs. We use a temporary cluster. For big grid searches we see that the dask workers accumulate memory over time and then stop accepting tasks. I assume similar to this thread.
However none of the recommended solutions helped. Has anyone had similar experience? Could this be related to prefect?
02/24/2022, 2:47 PM
Related to Prefect if you map around the 100k number of items. Actually are you trying models like XGBoost or LightGBM? Anything with
Also results are held in memory so what do the tasks output?
XGBoost and Lightgbm try to utilize all of the available cores on the machine they run on so if you send like two training jobs to a worker and they try to compete for the CPUs, you can get some conflict of resources and the jobs just deadlock
02/25/2022, 1:01 PM
Sorry for the lag. The tasks do not have any output in fact. They just log to mlflow after training and then are done. There should be no competition about compute resources. The only problem is memory buildup.
But I have been wondering if there is some element of the flow that is present on all workers and holds on to some data and thats why we see no reduction in memory pressure when we run gc.collect().
02/25/2022, 4:00 PM
Have you seen
video? There is also this that was recently merged in I think that might affect this
02/26/2022, 7:50 AM
Yes I watched it, but none of the things helped to debug our case. But if there is nothing prefect specific that you guys know about then we will have to dig into it.