https://prefect.io logo
Title
z

Zach

01/15/2020, 7:50 PM
I am having trouble understanding exactly at what level Prefect is able to parallelize tasks? I was looking at the Map/Reduce part of the docs for prefect, and saw that you could map a task over a number of inputs in parallel, but what if each of those tasks take up a lot of processing power, then things will be very slow. I want to use prefect as my workflow manager, but I am confused how things get parallelized. If I run prefect hooked up to my kubernetes cluster, do all the workflow steps happen inside of a single container in a single Pod? If I want to make tasks not just only have access to the resources of the Pod that the Prefect flow is running in, do I have to make each task kick off its own kubernetes job? Sorry for all the questions, it just isn't clear to me how to run a workflow on 50 inputs at once, or if that even makes sense (let me know if I should really just be running 50 separate workflows).
d

Dylan

01/15/2020, 8:31 PM
Hi Zach! This is definitely a common use case (computationally intensive tasks in parallel). We have a close integration with Dask for just this reason. If you’re also on Kubernetes, might I suggest taking a look at the DaskKubernetes environment (which I use for our internal ETL): https://docs.prefect.io/cloud/execution/dask_k8s_environment.html You may want to dig a little into Dask itself to understand the philosophy behind the tool and how you might best configure it for your needs: https://docs.dask.org/en/latest/ https://distributed.dask.org/en/latest/
Using kubernetes (and a Prefect Cloud Kubernetes agent) you’ll be able to spin up a Dask cluster of
n
workers, so the cluster can handle
n
tasks in parallel. Dask handles distributing tasks to each worker as they become available
Let me know if you have any questions, happy to talk more specifically about your use case and how mapping + dask might help
j

Joe Schmid

01/16/2020, 3:27 PM
Hi @Zach, while it's not obvious for people less familar with Dask, we chose Prefect in large part because it makes it very easy to run tasks in parallel on a cluster of machines. We run a lot of data science experiments in parallel and might have a Dask cluster with 100 worker nodes in AWS during a Prefect Flow run. There are a lot of good resources for running Dask clusters, especially the dask-kubernetes project that can help get you up and running. If I can answer any questions about our setup, definitely let me know.