https://prefect.io logo
Title
j

Jerry Thomas

09/20/2019, 5:06 AM
I would like to try running my prefect flow on a dask-kubernetes cluster. I saw that it is possible but I am not able to grasp how. Can you share an example of running a prefect flow on a dask-kubernetes cluster? Should I create a custom worker (pod/docker) for the cluster with the appropriate libraries? How does my prefect flow end up executing on the dask-worker if the worker pod/docker does not contain my code?
z

Zachary Hughes

09/20/2019, 6:53 PM
Hi Jerry! There are a few questions here, and I'll try to take a stab at each of them. We don't currently have an example of running a Prefect flow on
dask-kubernetes
, but we'll get that documented ASAP. In the meantime, this page might be useful if you haven't already found it: https://docs.prefect.io/core/tutorials/dask-cluster.html In this case, the workers just need access to your code's dependencies, not the actual code itself. So you'll need to install Prefect at a minimum, but likely more depending on your code's dependencies. As for how the code gets executed, Prefect takes care of submitting your serialized code to the Dask scheduler. As long as your code's dependencies are there, it should be able to be deserialized and you're off to the races.
j

Joe Schmid

09/20/2019, 8:16 PM
@Jerry Thomas We run Prefect flows on our own dask-kubernetes cluster. Like @Zachary Hughes said, you need to make sure any python packages that your flow depends on (e.g. prefect) are installed on your Dask workers. Once you've done that, the key part to run the flow on your Dask cluster is:
from prefect.engine.executors import DaskExecutor

executor = DaskExecutor(address="<tcp://dask-scheduler:8786>")
flow.run(executor=executor)
Where the
dask-scheduler
hostname will resolve in DNS on your Dask workers. (With dask-kubernetes that should be the default.)
(We run that code in a JupyterLab notebook running in a pod via dask-kubernetes so that it can reach the Dask scheduler via the
dask-scheduler
hostname.)
j

Jerry Thomas

09/23/2019, 5:17 AM
Thanks @Zachary Hughes and @Joe Schmid. So, this means that as long as I have all my dependencies in a
dask-worker
pod it is the same as running the flow using a local dask cluster. Am I correct in understanding that if I wanted to run a streaming pipeline with flow, I should start a
dask-kubernetes
cluster and add a separate pod with my custom application that initiates the flows. These flows will then be pushed to the
dask-workers
using the
dask-scheduler
in the cluster.