https://prefect.io logo
e

Egil Bugge

05/03/2022, 7:18 PM
Hey all! I've been playing around with setting up a Kubernetes agent in Google Kubernetes Engine which can spin up a ephemeral Dask cluster on demand. This all seems to work rather smoothly (thanks to the amazing work done by the Prefect team and others), but I'm having some issue getting the autoscaler to remove the nodes after the flow has run. I get the following error messages on my Kubernetes cluster after my flow has run: "Pod is blocking scale down because it’s not backed by a controller" "Pod is blocking scale down because it doesn’t have enough Pod Disruption Budget (PDB)" I'm pretty inexperienced with Kubernetes so I was wondering if anyone has any pointers to how I might configure the KubeCluster so that it works with autoscaling? We're thinking of using the cluster to hyperparameter tune a model. We do not use Kubernetes for anything else and have no need for the resources in between training runs so getting the node pool to autoscale down to zero (the agent will stay in a different node pool) would save us some money. My run code is below:
Copy code
import prefect
from prefect import Flow, task
from prefect.storage import Docker
from prefect.run_configs import KubernetesRun
from prefect.executors import DaskExecutor

from dask_kubernetes import KubeCluster, make_pod_spec

PROJECT_NAME = ...
FLOW_NAME = ...
IMAGE_TAG = ...

storage = Docker(
    registry_url = f"<http://eu.gcr.io/{PROJECT_NAME}|eu.gcr.io/{PROJECT_NAME}>",
    image_name = FLOW_NAME,
    image_tag  = IMAGE_TAG,
    dockerfile = "Dockerfile"
)

run_config = KubernetesRun(
    image = f"<http://eu.gcr.io/{PROJECT_NAME}/{FLOW_NAME}:{IMAGE_TAG}|eu.gcr.io/{PROJECT_NAME}/{FLOW_NAME}:{IMAGE_TAG}>", 
    labels = ["dask"])

executor = DaskExecutor(
        cluster_class=lambda: KubeCluster(
            make_pod_spec(
                image=prefect.context.image,
                extra_pod_config= {
                    "nodeSelector": {"<http://cloud.google.com/gke-nodepool|cloud.google.com/gke-nodepool>": "worker-pool"}
                }
        )),
        adapt_kwargs={"minimum": 1, "maximum": 3},
    )

with Flow(
    name = FLOW_NAME,    
    storage = storage,
    executor = executor,
    run_config = run_config) as flow:

    #tasks here
a

Anna Geller

05/03/2022, 9:19 PM
I'm having some issue getting the autoscaler to remove the nodes
Generally speaking, this is a Kubernetes issue, not a Prefect issue, and I'm also no DevOps expert, but let me try to help still. 1. You could try GKE Autopilot if you don't want to deal with this type of issues - it would remove this burden of autoscaling from you 2. If you want to make
Cluster Autoscaler
work on
GKE
, you have to create Disruptions with proper information, how to create it can be found in How to set PDBs to enable CA to move kube-system pods? - more on that can be found here From Prefect perspective, your flow is fine
e

Egil Bugge

05/04/2022, 7:01 AM
Thank you so much for looking into it @Anna Geller! You might be right that using Autopilot would be the easiest, at least for testing. I am a bit unclear where I would specify the resource demands for the Dask workers (we want those sweet GPUs obviously), but I can try digging a bit more into the dask-kubernetes library! I understand this is a bit "off-topic", was just hoping someone here has solved something similar 🙂
👍 1
36 Views