Hi! I’m working on a flow which requires large concurrency (tens of thousands of tasks). Previously I’ve used the
LocalDaskExecutor
as my requirements for concurrency were low. However with this flow I want to use the real
DaskExecutor
.
As I’m aware Prefect is capable of creating a temporal Dask cluster for the running flow (using
KubeCluster
), but alternatively can use an existing deployed Dask cluster (or even dask gateway if I’m not mistaken).
Note, I’m running in Kubernetes, so adaptive scaling could be interesting.
Are there any guidelines / suggestions / experiences for using a temporal vs. static Dask cluster? Additionally, if I have
Docker
as storage, how can I supply the correct image tag containing my modules/dependencies while registering the flow?
✅ 1
a
Anna Geller
12/08/2021, 9:35 AM
@Joël Luijmes Yes, there are! This page discusses the differences. And regarding KubeCluster, check out this blog post.
Anna Geller
12/08/2021, 9:36 AM
but if you have any specific questions or issues, let us know here
j
Joël Luijmes
12/08/2021, 9:38 AM
Oef, quick reply. Thanks! Missed this documentation page but discovered I can access the image using
prefect.context.image
🤓
👍 1
a
Anna Geller
12/08/2021, 9:38 AM
Regarding the image: You can provide it to your run configuration and specify the image from the context in your Cluster class, e.g.:
Okay, one more thing. Should the image contain the flow already? Or is there some magic that Prefect serializes the flow and copies it to the dask workers before starting tasks?
a
Anna Geller
12/08/2021, 9:59 AM
Storage defines where Prefect can find the flow, so this image doesn’t have to contain the flow. It’s only for package dependencies. But if you use Docker storage, then this image needs to contain the flow, either a pickled flow (default when using Docker storage) or
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.