Hello all I have a question about scaling to Kubernetes+Dask Prefect Community #ask-community

Hello all! I have a question about scaling to Kube...

Julien Allard

08/17/2020, 8:46 PM

Hello all! I have a question about scaling to Kubernetes+Dask in GCP. I am trying to run a flow that uses pandas for data transformation that requires a large amount of memory, more than what can be allocated for a single node of our current cluster configuration. Here are some of the ideas, I had to fix the problem 1. I could parallelize my flow using Dask Dataframe but that seems to be a lot of effort. 2. We could vertically scale the nodes, but then we would pay for more compute that we need most of the time 3. Create a preemtible node pool where the flow can then be executed I feel like the 3rd point could be a good solution, but I am not sure how it could work to allocate a preemtible node pool at the start of a flow and then execute the flow on the new node pool since we have already deployed the job or dask cluster on the other nodes. So I guess my question is what is the best way to achieve what i'm trying to do without having to vertically scale the cluster? Thank you!

Kyle Moon-Wright

08/18/2020, 1:02 AM

Hello @Julien Allard, With your data load it feels like your best option is to increase your cluster capacity manually or with an autoscaler, though I can't speak to the Dask Dataframe option. Would love to hear from others on their suggestions and experiences.

Julien Allard

08/18/2020, 12:52 PM

@Kyle Moon-Wright Thanks for giving your input! When you say increasing the cluster capacity, do you mean adding more nodes or increasing the node compute power?

Julien Allard

08/18/2020, 4:58 PM

Update: I decided to go with an autoscaling node pool with preemptible machine with the required compute power. When no flows are running, the node pool will be autoscaled to 0, thus being a very cost effective solution. For more details, see this article: https://medium.com/google-cloud/scale-your-kubernetes-cluster-to-almost-zero-with-gke-autoscaler-9c78051cbf40#:~:text=However%2C%20cluster%20autoscaler%20cannot%20completely,one%20expensive%20node%20running%20idle.

upvote 1

Kyle Moon-Wright

08/18/2020, 5:24 PM

That's awesome! Thanks for sharing.

Skip Breidbach

08/19/2020, 10:21 PM

@Julien Allard we've been working with Dask dataframes and at least so far it hasn't been much different from working with a Pandas dataframe except that you get lazy load and parallelization.

👍 1

Julien Allard

08/20/2020, 1:54 PM

That's good to hear! We will probably eventually switch to using Dask dataframe

👍 1

9 Views

Open in Slack

Previous Next