yair friedman

05/18/2022, 12:44 PM
Hi , first - thanks for the wonderful package! i have created a basic scikit pipeline flow , and i am using the DaskExecutor as the executor. i wonder if the scikit-learn algorithms are really using dask to run in a parallelised distributed manner or the wrapper task are distributed but the actual ML work in it is running locally…. is it the same as running daskML algorithms ?
🎉 1

Anna Geller

05/18/2022, 12:47 PM
Hi, welcome to Prefect! I can only say for sure that your function task gets submitted to a Dask cluster for execution - if there is some specific Dask integration within your package you may need to use Dask with a resource manager e.g.
maybe if you could share your flow code it might be easier to help?

yair friedman

05/18/2022, 12:51 PM
i have some variation on that uses the DaskExecutor. i wonder if that is enough to distribute scikit internal computation

Kevin Kho

05/18/2022, 2:09 PM
That will distribute over Dask but it specifically is the like “compute-bound” portion of dask-ml, not the “memory-bound” where you train a model on a Dask DataFrame that is too big for one machine

yair friedman

05/18/2022, 2:50 PM
Got it, thanks!