Hello all! I have a question about scaling to Kube...
# prefect-community
j
Hello all! I have a question about scaling to Kubernetes+Dask in GCP. I am trying to run a flow that uses pandas for data transformation that requires a large amount of memory, more than what can be allocated for a single node of our current cluster configuration. Here are some of the ideas, I had to fix the problem 1. I could parallelize my flow using Dask Dataframe but that seems to be a lot of effort. 2. We could vertically scale the nodes, but then we would pay for more compute that we need most of the time 3. Create a preemtible node pool where the flow can then be executed I feel like the 3rd point could be a good solution, but I am not sure how it could work to allocate a preemtible node pool at the start of a flow and then execute the flow on the new node pool since we have already deployed the job or dask cluster on the other nodes. So I guess my question is what is the best way to achieve what i'm trying to do without having to vertically scale the cluster? Thank you!
k
Hello @Julien Allard, With your data load it feels like your best option is to increase your cluster capacity manually or with an autoscaler, though I can't speak to the Dask Dataframe option. Would love to hear from others on their suggestions and experiences.
j
@Kyle Moon-Wright Thanks for giving your input! When you say increasing the cluster capacity, do you mean adding more nodes or increasing the node compute power?
Update: I decided to go with an autoscaling node pool with preemptible machine with the required compute power. When no flows are running, the node pool will be autoscaled to 0, thus being a very cost effective solution. For more details, see this article: https://medium.com/google-cloud/scale-your-kubernetes-cluster-to-almost-zero-with-gke-autoscaler-9c78051cbf40#:~:text=However%2C%20cluster%20autoscaler%20cannot%20completely,one%20expensive%20node%20running%20idle.
upvote 1
k
That's awesome! Thanks for sharing.
s
@Julien Allard we've been working with Dask dataframes and at least so far it hasn't been much different from working with a Pandas dataframe except that you get lazy load and parallelization.
👍 1
j
That's good to hear! We will probably eventually switch to using Dask dataframe
👍 1