Hi. I was wondering if anyone has experience launc...
# prefect-community
v
Hi. I was wondering if anyone has experience launching ephemeral dask clusters with hybrid worker specs. ie. launching a set of workers with GPU tagged resource and regular compute nodes. Right now we are developing a pipeline which has combines some CPU intensive tasks with GPU tasks. The Dask task affinity tags seem like a key component towards this implementation, but I don't think that the DaskKubernetesEnvironment supports workers with different specs. Any suggestions on this front
z
Hi Vincent! I believe there is an issue open for this if you could take a look and see if it aligns with what you need https://github.com/PrefectHQ/prefect/issues/1586 — if so, adding your case to that discussion is the best way to promote it on our roadmap. Of course, some of our users may have additional workarounds and this is a great place to hear from them as well 🙂
j
Note that
dask-kubernetes
doesn't currently support multiple worker types - all workers are assumed to be uniform across the cluster (this is true for all dask cluster managers like
dask-kubernetes
, though not required by the scheduler itself). There's an open issue for this here: https://github.com/dask/distributed/issues/2118. There's not much prefect can do to work around this, a fix would be needed upstream before we could support this.
z
(Also I notice that the github user there is also named Vincent so I presume that was you and that you already knew about that issue — sorry!)
👍 1
j
@Vincent We do this both for GPUs and high memory workers, i.e. launch a Dask cluster with 10 regular workers, 2 GPU, and 1 with higher RAM. We're on AWS and are launching ephemeral Dask clusters on ECS (Fargate or EC2 launch type) using Dask Cloud Provider. DCP doesn't support this right now, but I have the code we're using in a fork of DCP available here: https://github.com/joeschmid/dask-cloudprovider/tree/multiple-worker-types (I need to clean this up and submit a PR, just way behind on that...) We've had a good experience so far, but I will say this is a bit "out there" and experimental, e.g. if a GPU or high-memory worker dies, DCP won't recreate those special workers, if you try to scale up you will only get "regular" workers, etc.
v
Thanks for the reference! Having Fargate and EC2 nodes sounds just like what we are looking for. Thanks for the reference!
j
@Vincent No problem and feel free to ask more questions as you go. You probably already know, but Fargate doesn't currently support GPUs and limits RAM to 30GB so we use EC2 launch type with auto-scaling groups for those. If you look through the commits in my fork of DCP, you'll notice we allow specifying a "capacity provider strategy" per worker type. These map to capacity providers that you configure in ECS, e.g. GPU_Capacity_Provider, etc. It's pretty well documented, but there are a lot of moving parts to configure so don't hesitate to let me know if I can help with anything as you get into it.
🙏 1