Hello all! I have a question about scaling to Kubernetes+Dask in GCP. I am trying to run a flow that uses pandas for data transformation that requires a large amount of memory, more than what can be allocated for a single node of our current cluster configuration.
Here are some of the ideas, I had to fix the problem
1. I could parallelize my flow using Dask Dataframe but that seems to be a lot of effort.
2. We could vertically scale the nodes, but then we would pay for more compute that we need most of the time
3. Create a preemtible node pool where the flow can then be executed
I feel like the 3rd point could be a good solution, but I am not sure how it could work to allocate a preemtible node pool at the start of a flow and then execute the flow on the new node pool since we have already deployed the job or dask cluster on the other nodes.
So I guess my question is what is the best way to achieve what i'm trying to do without having to vertically scale the cluster?
Thank you!