ciaran

    ciaran

    1 year ago
    Hey folks! Does anyone have an example repo they could share that's handling the deployment of a Prefect Agent with AKS? Currently starting off with spinning up a cluster with Terraform, but my k8s skills are sub-par so some useful starters would be handy. For reference i've CDK'd up a Prefect Agent & respective cluster in ECS on AWS before, but that wasn't involving k8s.
    Tyler Wanner

    Tyler Wanner

    1 year ago
    Hiya! Did you check out the cli command? i’ll grab the doc. I’ve also used terraform to deploy agents myself but prefect agent kubernetes install will generate a yaml manifest for you
    ciaran

    ciaran

    1 year ago
    Hey @Tyler Wanner yeah I got to the CLI part and then wanted to see if anyone had done it in a IAC manner - Is there a certain terraform provider that can run that manifest?
    Tyler Wanner

    Tyler Wanner

    1 year ago
    there’s not a fully supported terraform provider for handling raw yaml at the moment but I can share with you an example agent terraform config if you’d like
    ciaran

    ciaran

    1 year ago
    If you could that'd be amazing
    I'm out of my depth in Azure and k8s 🤣 AWS is my happy place
    Tyler Wanner

    Tyler Wanner

    1 year ago
    well with Prefect and AKS you shouldn't need to worry too too much about k8s to get going!
    this isn't a "supported" install pattern so no guarantees but here's an all-in example that will create a prefect agent and the rbac as if you used
    prefect agent kubernetes install
    resource "kubernetes_namespace" "ci" {
      metadata {
        name = "prefect"
      }
    }
    
    resource "kubernetes_role" "prefect_agent" {
      metadata {
        name      = "prefect-agent"
        namespace = kubernetes_namespace.ci.metadata[0].name
      }
    
      rule {
        api_groups = ["batch", "extensions"]
        resources  = ["jobs"]
        verbs      = ["*"]
      }
      rule {
        api_groups = [""]
        resources  = ["events", "pods"]
        verbs      = ["*"]
      }
    }
    
    resource "kubernetes_role_binding" "prefect_agent" {
      metadata {
        name      = "prefect-agent"
        namespace = kubernetes_namespace.ci.metadata[0].name
      }
    
      role_ref {
        api_group = "<http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>"
        kind      = "Role"
        name      = kubernetes_role.prefect_agent.metadata[0].name
      }
    
      subject {
        kind      = "ServiceAccount"
        name      = kubernetes_service_account.agent.metadata[0].name
        namespace = kubernetes_namespace.ci.metadata[0].name
      }
    }
    
    resource "kubernetes_service_account" "agent" {
      metadata {
        name      = "agent"
        namespace = kubernetes_namespace.ci.metadata[0].name
      }
    }
    
    
    
    resource "kubernetes_deployment" "deployment" {
      metadata {
        name      = <http://var.app|var.app>
        namespace = kubernetes_namespace.ci.metadata[0].name
      }
    
      spec {
        replicas = "1"
    
        selector {
          match_labels = {
            app = <http://var.app|var.app>
          }
        }
    
        template {
          metadata {
            labels = {
              app = <http://var.app|var.app>
            }
          }
    
          spec {
            service_account_name            = kubernetes_service_account.agent.metadata[0].name
            automount_service_account_token = true
    
            container {
              args    = ["prefect agent kubernetes start"]
              command = ["/bin/bash", "-c"]
    
              env {
                name  = "PREFECT__CLOUD__AGENT__AUTH_TOKEN"
                value = var.auth_token
              }
    
              env {
                name  = "PREFECT__CLOUD__AGENT__AGENT_ADDRESS"
                value = "http://:8080"
              }
    
              env {
                name  = "NAMESPACE"
                value = kubernetes_namespace.ci.metadata[0].name
              }
    
              env {
                name  = "PREFECT__CLOUD__AGENT__LABELS"
                value = "['foo']"
              }
    
              dynamic env {
                for_each = var.env_vars
                content {
                  name  = env.name
                  value = env.value
                }
              }
    
              image             = "prefecthq/prefect:${var.prefect_version}"
              name              = <http://var.app|var.app>
              image_pull_policy = "Always"
    
              liveness_probe {
                http_get {
                  path = "/api/health"
                  port = 8080
                }
    
                failure_threshold     = 2
                initial_delay_seconds = 40
                period_seconds        = 40
              }
    
              resources {
                limits {
                  cpu    = "500m"
                  memory = "128Mi"
                }
              }
            }
          }
        }
      }
    }
    
    variable "auth_token" {}
    variable "app" { default = "prefect-agent" }
    variable "prefect_version" { default = "latest" }
    variable "env_vars" { 
        type = map 
        default = null
    }
    mind the agent configuration, especially the addition of the "foo" label (which will probably not pick up any of your flows, unless that label is present on the flow/ flow run)
    you'll need to supply an auth_token, for which you'll want to use a Prefect Cloud serviceaccount api key
    Also I do believe we've removed the resources block from the generic install template. It's best to set them at a proper level, but you may just want to remove them to get started
    ciaran

    ciaran

    1 year ago
    Cool thanks for this! I'll take a look!
    Tyler Wanner

    Tyler Wanner

    1 year ago
    let me know how it goes!
    btw I left out the k8s provider configuration... for that, reference the provider docs directly, as your interaction pattern will determine how you set that up https://registry.terraform.io/providers/hashicorp/kubernetes/1.11.0/docs
    ciaran

    ciaran

    1 year ago
    So I guess the alternative to your TF example is deploying the AKS cluster with TF then just using
    kubectl
    to apply that manifest?
    @Tyler Wanner off-topic, but I'm pretty sure I watched a demo you did on Youtube earlier this week 🤣
    Tyler Wanner

    Tyler Wanner

    1 year ago
    yep @ciaran the easiest way to deploy the agent is
    prefect kubernetes agent install --rbac --namespace NAMESPACE -t TOKEN | kubectl apply -n NAMESPACE -f -
    personally, i'm a big fan of declarative infrastructure code so I use a mixture of both to manage my k8s prefect agents
    ciaran

    ciaran

    1 year ago
    Awesome thanks, appreciate the help! Yeah coming from CDK I'm definitely learning towards preferring doing this in Terraform
    Just wanted to get my head around what that's doing under the hood too
    Tyler Wanner

    Tyler Wanner

    1 year ago
    well then you're asking all the right questions 👍
    ciaran

    ciaran

    1 year ago
    So, slightly dim question, your TF example, how does it get placed into the AKS cluster? I don't see a reference to a cluster
    Tyler Wanner

    Tyler Wanner

    1 year ago
    that's part of the kubernetes provider configuration
    you can either inherit your local kube context or set up a link to a particular cluster
    ciaran

    ciaran

    1 year ago
    Interesting. Ah I see so actually the deployment of my AKS cluster is separate to my Prefect Agent setup? I'll need two terraform 'projects' kind of
    Tyler Wanner

    Tyler Wanner

    1 year ago
    in my experience, yes, but I'm not sure that's 100% true
    If the provider configuration is dependent upon that cluster's state (it is the way I do it) or existence, then you'll be much better off separating them
    ciaran

    ciaran

    1 year ago
    Whelp beat me to it.
    Eurgh. Forgot how chicken and egg terraform was. Cloudformation spoils me
    So @Tyler Wanner I'm trying to run:
    prefect agent kubernetes install -t "<token>" --rbac -n "pangeo-forge-azure-bakery" -l "ciaran-dev" | kubectl apply -f --namespace=pangeo-forge-azure-bakery -
    Based on https://docs.prefect.io/orchestration/agents/kubernetes.html#running-in-cluster But I'm getting:
    error: Unexpected args: [-]
    If I remove the
    -
    I instead get:
    error: the path "--namespace=pangeo-forge-azure-bakery" does not exist
    Tyler Wanner

    Tyler Wanner

    1 year ago
    the - needs to be an argument passed to -f
    can u try moving -f to after —namespace?
    if that works then we’ll need to fix the docs
    ciaran

    ciaran

    1 year ago
    Warning: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1> RoleBinding is deprecated in v1.17+, unavailable in v1.22+; use <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1> RoleBinding
    Error from server (NotFound): error when creating "STDIN": namespaces "pangeo-forge-azure-bakery" not found
    Error from server (NotFound): error when creating "STDIN": namespaces "pangeo-forge-azure-bakery" not found
    Error from server (NotFound): error when creating "STDIN": namespaces "pangeo-forge-azure-bakery" not found
    Bit further haha
    Tyler Wanner

    Tyler Wanner

    1 year ago
    did you create the namespace?
    ciaran

    ciaran

    1 year ago
    Oh, does it need to be created beforehand?
    I assumed this did it also
    Tyler Wanner

    Tyler Wanner

    1 year ago
    if you’re plopping into a non-default namespace you’ll have to create it yea
    ciaran

    ciaran

    1 year ago
    TIL
    Tyler Wanner

    Tyler Wanner

    1 year ago
    this is because the prefect CLI doesn’t know anything about your kubernetes environment it’s just generating a yaml manifest
    so we can’t assume to create one
    ciaran

    ciaran

    1 year ago
    That's fair enough. Annoyingly it looks like
    azurerm_kubernetes_cluster
    doesn't offer that option.
    So
    kubernetes
    provider/kubectl it is
    Tyler Wanner

    Tyler Wanner

    1 year ago
    kubectl create namespace NAMESPACE
    will do it
    the cluster resource will not provide you with an interface for configuring namespaces, surely
    ciaran

    ciaran

    1 year ago
    🤷 Probably obvious but k8s is new to me haha
    Tyler Wanner

    Tyler Wanner

    1 year ago
    not much about k8s is “obvious” but once u get over the learning curve, it’s great magic
    fortunately you won’t need to know much more than this for your prefect agent to be able to make good use of it but surely you will learn through debugging things that arise inevitably
    ciaran

    ciaran

    1 year ago
    😛arty-parrot: Wahahay
    Tyler Wanner

    Tyler Wanner

    1 year ago
    happy k8sing!
    ciaran

    ciaran

    1 year ago
    Many thanks! Really appreciate it
    Shall I raise an issue about the docs?
    Tyler Wanner

    Tyler Wanner

    1 year ago
    you may if you’d like, and I’ll take care of it
    or feel free to submit a PR if you’d like to contribute
    ciaran

    ciaran

    1 year ago
    👍 Oh when in Rome.
    I'll raise a PR
    Tyler Wanner

    Tyler Wanner

    1 year ago
    awesome thanks for that @ciaran 🙏
    ciaran

    ciaran

    1 year ago
    Hey @Tyler Wanner sorry for reviving this, I've been trying run a very simple flow and I'm hitting this:
    (403)
    Reason: Forbidden
    HTTP response headers: HTTPHeaderDict({'Audit-Id': 'af45fea8-a5be-4d4a-a50c-fc8875a83144', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 23 Apr 2021 12:24:44 GMT', 'Content-Length': '329'})
    HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch is forbidden: User \"system:serviceaccount:default:default\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"group":"batch","kind":"jobs"},"code":403}
    The yaml I'm applying to the cluster looks like:
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: prefect-agent
      name: prefect-agent
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: prefect-agent
      template:
        metadata:
          labels:
            app: prefect-agent
        spec:
          containers:
          - args:
            - prefect agent kubernetes start
            command:
            - /bin/bash
            - -c
            env:
            - name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
              value: ${PREFECT__CLOUD__AGENT__AUTH_TOKEN}
            - name: PREFECT__CLOUD__API
              value: <https://api.prefect.io>
            - name: NAMESPACE
              value: ${BAKERY_NAMESPACE}
            - name: IMAGE_PULL_SECRETS
              value: ''
            - name: PREFECT__CLOUD__AGENT__LABELS
              value: '${PREFECT__CLOUD__AGENT__LABELS}'
            - name: JOB_MEM_REQUEST
              value: ''
            - name: JOB_MEM_LIMIT
              value: ''
            - name: JOB_CPU_REQUEST
              value: ''
            - name: JOB_CPU_LIMIT
              value: ''
            - name: IMAGE_PULL_POLICY
              value: ''
            - name: SERVICE_ACCOUNT_NAME
              value: ''
            - name: PREFECT__BACKEND
              value: cloud
            - name: PREFECT__CLOUD__AGENT__AGENT_ADDRESS
              value: http://:8080
            image: prefecthq/prefect:0.14.16-python3.8
            imagePullPolicy: Always
            livenessProbe:
              failureThreshold: 2
              httpGet:
                path: /api/health
                port: 8080
              initialDelaySeconds: 40
              periodSeconds: 40
            name: agent
    ---
    apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
    kind: Role
    metadata:
      name: prefect-agent-rbac
      namespace: default
    rules:
    - apiGroups:
      - batch
      - extensions
      resources:
      - jobs
      verbs:
      - '*'
    - apiGroups:
      - ''
      resources:
      - events
      - pods
      verbs:
      - '*'
    ---
    apiVersion: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>
    kind: RoleBinding
    metadata:
      name: prefect-agent-rbac
      namespace: default
    roleRef:
      apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
      kind: Role
      name: prefect-agent-rbac
    subjects:
    - kind: ServiceAccount
      name: default
    And my
    KubernetesRun
    config looks like:
    run_config=KubernetesRun(
            image="prefecthq/prefect:0.14.16-python3.8",
            labels=json.loads(os.environ["PREFECT__CLOUD__AGENT__LABELS"]),
        ),
    Any ideas? Appreciate it!
    Tyler Wanner

    Tyler Wanner

    1 year ago
    ah yep that's a missing RBAC permission
    "Failure","message":"jobs.batch is forbidden: User \"system:serviceaccount:default:default\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"pangeo-forge-azure-bakery\""
    ^^ this is saying that something in the default namespace, with the serviceaccount default, cannot create jobs in the namespace pangeo-forge-azure-bakery
    is this log in the agent?
    ciaran

    ciaran

    1 year ago
    I think I've not templated the namespace name correctly.
    Tyler Wanner

    Tyler Wanner

    1 year ago
    that's certainly the right vein--let me check out the full file real quick
    it looks like your agent deployment is still in the default namespace--did you mean to run it in your bakery namespace with the jobs it creates?
    ciaran

    ciaran

    1 year ago
    Yeah I'll be honest I think what I've done is used prefect to generate that yaml file, then not fully templated out the bakery namespace.
    Tyler Wanner

    Tyler Wanner

    1 year ago
    as long as you apply with -n NAMESPACE you should be ok
    otherwise, just add
    namespace: pangeo-forge-azure-bakery
    on line 8
    ciaran

    ciaran

    1 year ago
    Inside the
    spec
    section?
    Tyler Wanner

    Tyler Wanner

    1 year ago
    no sorry, between line 7 and 8, same as in the other resources--in metadata
    ciaran

    ciaran

    1 year ago
    gotcha
    Tyler Wanner

    Tyler Wanner

    1 year ago
    your original file before that PR you linked should work if you just specify -n NAMESPACE in the kubectl apply command btw
    but now you're explicitly stating htat in the configuration
    ciaran

    ciaran

    1 year ago
    Ahh okay. Yeah I was hoping to get the overall template and then folks can dynamically set it. Mainly because there could be multiple deployments etc.
    Tyler Wanner

    Tyler Wanner

    1 year ago
    OK then don't make these changes, and just change the apply command
    ciaran

    ciaran

    1 year ago
    So, progress.
    Failed to load and execute Flow's environment: AttributeError("'NoneType' object has no attribute 'rstrip'")
    Tyler Wanner

    Tyler Wanner

    1 year ago
    hmm i'm not sure on that one and i'll have to step away a while but glad to get past that one
    ciaran

    ciaran

    1 year ago
    Okay no problem thanks for the help so far!
    Tyler Wanner

    Tyler Wanner

    1 year ago
    can you check out the version of prefect that built the flow and the version of prefect in the flow storage image?
    it sounds like a version mismatch
    ciaran

    ciaran

    1 year ago
    Both the agent and flow are set to use
    prefecthq/prefect:0.14.16-python3.8
    and my local install is also
    0.14.16
    on Python
    3.8.6
    Tyler Wanner

    Tyler Wanner

    1 year ago
    paging @Kevin Kho
    ciaran

    ciaran

    1 year ago
    The only thing that may be different is I installed
    prefect[azure, kubernetes]
    locally...
    Kevin Kho

    Kevin Kho

    1 year ago
    Hi @ciaran! How big is your flow code?
    Do you think you can share it?
    Kevin Kho

    Kevin Kho

    1 year ago
    How did you register this flow? Running the Python script?
    ciaran

    ciaran

    1 year ago
    Yep, just running it with python
    Kevin Kho

    Kevin Kho

    1 year ago
    My best advice is to try Github storage first to determine if it’s a storage problem. I am also wondering if your labels comes out as a list of strings. Could you check that?
    ciaran

    ciaran

    1 year ago
    It's available in the agent under the same env var name
    (I had found that thread 😅 )
    I'll double check my labels.
    Kevin Kho

    Kevin Kho

    1 year ago
    Github storage will help us check if it’s an Azure storage issue
    ciaran

    ciaran

    1 year ago
    >>> import json
    >>> import os
    >>> json.loads(os.environ["PREFECT__CLOUD__AGENT__LABELS"])
    ['ciarandev']
    Tried it in the python interpreter, looks like a list.
    Kevin Kho

    Kevin Kho

    1 year ago
    that looks good
    ciaran

    ciaran

    1 year ago
    I might have to try the Github storage monday. Finishing up for the day here 😅 Thanks for the pointers though, I'll push with Github storage first thing monday/this weekend if I get a chance and let you know!
    Kevin Kho

    Kevin Kho

    1 year ago
    Sure thing!
    ciaran

    ciaran

    1 year ago
    This was much more in my comfort zone on AWS and ECS 🤣 Feel like Azure & k8s is polar opposite lol
    So just about to try this out, I'm assuming that
    access_token_secret
    in https://docs.prefect.io/api/latest/storage.html#github is only necessary for private repositories?
    Gave it a go without the
    access_token_secret
    . I get:
    Which makes sense as this value would be serialised on a non-repo storage target. So looks like it can at least download and attempt to run the flow via Github
    So I'm guessing this is pointing us to Azure Storage being an issue
    Kevin Kho

    Kevin Kho

    1 year ago
    Hey @ciaran, that does seem to be the case. I would make sure that the values are working. I wanna mention that I literally handed someone your aws recipes code for them to understand the ECS task definition.
    ciaran

    ciaran

    1 year ago
    Haha oh goodness. I hope they found it handy! By making sure the values are working, what do you mean?
    Kevin Kho

    Kevin Kho

    1 year ago
    That the storage connection string works outside of Prefect
    ciaran

    ciaran

    1 year ago
    Ah right. Got you
    Kevin Kho

    Kevin Kho

    1 year ago
    I’ll be able to try your flow later today do help diagnose.
    ciaran

    ciaran

    1 year ago
    Okay, for reference I ran:
    >>> con_string = "<the con string>"
    >>> from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
    >>> blob_service_client = BlobServiceClient.from_connection_string(con_string)
    >>> for container in blob_service_client.list_containers():
    ...     print(container)
    ... 
    {'name': 'ciarandev-bakery-flow-storage-container', 'last_modified': datetime.datetime(2021, 4, 26, 8, 14, 17, tzinfo=datetime.timezone.utc), 'etag': '"0x8D9088B4947D0D2"', 'lease': {'status': 'unlocked', 'state': 'available', 'duration': None}, 'public_access': None, 'has_immutability_policy': False, 'deleted': None, 'version': None, 'has_legal_hold': False, 'metadata': None, 'encryption_scope': <azure.storage.blob._models.ContainerEncryptionScope object at 0x106ce0f10>}
    So looks like the connection string is correct.
    Kevin Kho

    Kevin Kho

    1 year ago
    I guess it might be a Docker container and K8s? Is your container for this example minimal?
    ciaran

    ciaran

    1 year ago
    Oh well actually we knew the connection string worked as it successfully stores the flow, it just falls over running it.
    The container is literally just the
    prefecthq/prefect:0.14.16-python3.8
    image
    Kevin Kho

    Kevin Kho

    1 year ago
    Perfect. Will test later.
    ciaran

    ciaran

    1 year ago
    Cool thanks, appreciate it!
    And so that it's beside the links, this is the error I get:
    Failed to load and execute Flow's environment: AttributeError("'NoneType' object has no attribute 'rstrip'")
    Kevin Kho

    Kevin Kho

    1 year ago
    I replicated this on AzureStorage + LocalRun. Trying to figure out how to fix now.
    The issue here is that the
    AzureStorage
    class does not store the connection_string needed to retrieve your flow. This might be something we need to fix on our end (I raised the issue to the team). In the meantime, you need to pass an environment variable to get around this.
    with Flow(
        "azure_flow",
        run_config=LocalRun(env={"AZURE_STORAGE_CONNECTION_STRING": connection_string}),
        storage=storage.Azure(
            container="test",
            connection_string=connection_string,
        )
    ) as flow:
        hello_result = say_hello()
    ciaran

    ciaran

    1 year ago
    Hey @Kevin Kho thanks for looking into this! So do I even need to provide
    connection_string
    to
    storage.Azure
    ? If the environment variable is the one it uses?
    Awesome, so adding that as an environment variable to the run config for K8s does seem to work. Be interested still in whether it's actually needed elsewhere
    Kevin Kho

    Kevin Kho

    1 year ago
    You need it in Azure storage because that’s for uploading
    ciaran

    ciaran

    1 year ago
    Ah Ok. I assumed that also covered the downloading
    Kevin Kho

    Kevin Kho

    1 year ago
    Well it does for other storage classes. We’ll be opening an issue for this.
    ciaran

    ciaran

    1 year ago
    Ahh gotcha.
    Whilst I have this Big 'Ol Thread - Do you have any folks that know of a nice way to deal with Prefect & AKS logs? The process of finding the logs for just 1 flow run is quite janky at the mo (and it's likely not a Prefect issue)
    Kevin Kho

    Kevin Kho

    1 year ago
    You work with Sean right? Yeah, unfortunately this is a Dask issue where Dask doesn’t natively move the logs around. Attaching the CloudHandler to the logger before it’s sent to Dask workers doesn’t quite work because the logger gets reinstantiated on the worker side. You need some kind of service to write the logs to another location and collect it from there.
    ciaran

    ciaran

    1 year ago
    So actually I think this is a slightly different, AKS specific issue But yeah I work with Sean, we're putting y'all in your paces 😅
    Kevin Kho

    Kevin Kho

    1 year ago
    Ah I haven’t used AKS. What do you think is the issue? I can take a look some time.
    ciaran

    ciaran

    1 year ago
    For reference, this is how I'm managing to view the logs for the flow I ran: https://github.com/pangeo-forge/pangeo-forge-azure-bakery/tree/add-k8s-cluster#logging
    Essentially unlike something like ECS where a flow is a task and I can click through to the exact Cloudwatch logs, with the Azure logging, I have to do some janky queries to get the container ID of the pod that ran the flow then query for the logs for that ID
    But given I've very little AKS experience, I'm probably just doing it a really backwards way lol
    Kevin Kho

    Kevin Kho

    1 year ago
    Oh ok I see what you mean. Yeah, outside of Prefect, but I have a bit of Azure experience in general. Will take a quick look and see what I find sometime.
    ciaran

    ciaran

    1 year ago
    Cool thanks, no rush as it's only really a pain if something goes wrong and I need to dig into the AKS side of the logging
    It might just be a 'thing' 🤷
    So, trying out setting up DaskExecutor on my Kubernetes flows
    def get_cluster():
        pod_spec = make_pod_spec(
            image="prefecthq/prefect:0.14.16-python3.8",
            labels={"flow": flow_name},
            memory_limit='4G',
            memory_request='4G'
        )
        return KubeCluster(pod_spec)
    
    ...
    
    
    executor=DaskExecutor(
        cluster_class=get_cluster(),
    )
    Gives me:
    Traceback (most recent call last):
      File "flow_test/manual_flow.py", line 8, in <module>
        from dask_kubernetes import KubeCluster, make_pod_spec
      File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/__init__.py", line 3, in <module>
        from .core import KubeCluster
      File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/core.py", line 19, in <module>
        from .objects import (
      File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/objects.py", line 34, in <module>
        SERIALIZATION_API_CLIENT = DummyApiClient()
      File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/objects.py", line 28, in __init__
        self.configuration = Configuration.get_default_copy()
    AttributeError: type object 'Configuration' has no attribute 'get_default_copy'
    Kevin Kho

    Kevin Kho

    1 year ago
    I think
    cluster_class
    should be callable. Can you try removing the
    ()
    ?
    But I think the ideal situation is you put in
    KubeCluster
    there and use the
    cluster_kwargs
    to pass the pod spec.
    ciaran

    ciaran

    1 year ago
    Hmmm okay, what shape would the pod spec look like in
    cluster_kwargs
    ?
    executor=DaskExecutor(
        cluster_class="dask_kubernetes.KubeCluster",
        cluster_kwargs={
            "image": "prefecthq/prefect:0.14.16-python3.8",
            "labels": {"flow": flow_name},
            "memory_limit": "4G",
            "memory_request": "4G"
        }
    )
    Gives the same error
    Kevin Kho

    Kevin Kho

    1 year ago
    This seems right. Can you try passing the class directly instead of a string? (probably will be the same though)
    Am I reading it right that the error is from the port though? I would check the versioning where it was registered and where the agent is running.
    ciaran

    ciaran

    1 year ago
    I should point out this is during registration
    Directly vs string didn't do much
    Huh
    Now it just fails even with it commented out.
    Oh it's failing on the
    from dask_kubernetes import KubeCluster, make_pod_spec
    line
    Kevin Kho

    Kevin Kho

    1 year ago
    Oh that’s what I meant from the last message. Yeah it’s failing on the import so it seems like something is wrong with Kubecluster independent of Prefect
    ciaran

    ciaran

    1 year ago
    Gotcha.
    dask-kubernetes 2021.3.0
    Ah. Looks like
    prefect[kubernetes]
    is installing an older version
    kubernetes                           11.0.0b2
    So I guess either I bump Kubernetes, or Prefect does 😅
    Kevin Kho

    Kevin Kho

    1 year ago
    I see
    ciaran

    ciaran

    1 year ago
    Shall I raise an issue @Kevin Kho?
    Kevin Kho

    Kevin Kho

    1 year ago
    Is bringing kubernetes down to that an option for your use case?
    ciaran

    ciaran

    1 year ago
    So the error I'm seeing is because Prefect is installing a older version of Kubernetes.
    dask-kubernetes
    has that error if k8s is less than
    v12
    , but
    dask-kubernetes
    was installed via Prefect
    So Prefect is currently installing 1 lib that relies on another it installs, but the versions aren't compatible
    So maybe a lower version of
    dask-kubernetes
    might work, but it feels like we shouldn't drop versions down
    Kevin Kho

    Kevin Kho

    1 year ago
    Oh I see. You can open an issue for sure but i’m sure you’re aware that may take a while to complete due to compatibility testing
    I get what you mean now about bumping it up.
    ciaran

    ciaran

    1 year ago
    Sure, I mean I'll install a newer k8s version locally to see if that fixes it
    If it does, then I can at least recommend extending the versions of
    kubernetes
    that
    prefect
    installs.
    @Kevin Kho I had to manually bump
    kubernetes
    to
    12.0.1
    via pip because Poetry wouldn't let me do it via that (as Prefects range doesn't include it). However, when I uncommented out those dask_kubernetes imports, the flow registration happened without error. So I think there's definitely evidence of Prefect pulling in a Kubernetes version that is incompatible with the
    dask_kubernetes
    version it pulls in.
    I've raised https://github.com/PrefectHQ/prefect/issues/4451 which hopefully contains enough information
    Kevin Kho

    Kevin Kho

    1 year ago
    Thanks for raising @ciaran! and for helping Ben!
    ciaran

    ciaran

    1 year ago
    👋 me again 😅 So, I manually installed
    kubernetes==12.0.1
    and made a image for my agents/pods that is just the
    prefect==0.14.17
    image, with the newer k8s version installed
    Managed to use
    KubeCluster
    in my Flow registration
    Running it, I get:
    HTTP response headers: <CIMultiDictProxy('Audit-Id': 'e60cf00e-93e6-4fb0-802e-75a298fa0867', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 30 Apr 2021 16:01:26 GMT', 'Content-Length': '386')>
    HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"dask-root-857b2f2e-0k55nf\" is forbidden: User \"system:serviceaccount:pangeo-forge-azure-bakery:default\" cannot get resource \"pods/log\" in API group \"\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"name":"dask-root-857b2f2e-0k55nf","kind":"pods"},"code":403}
    Guessing I've hit more k8s errors...
    @Tyler Wanner probably your territory this one!
    Tyler Wanner

    Tyler Wanner

    1 year ago
    heya--we shipped a PR to master about the kubernetes version btw
    that's definitely a kubernetes RBAC permissioning problem
    your dask root pod is using pangeo-forge-azure-bakery/default as a serviceaccount, can you tell me what your agent is using?
    not sure what RBAC you have permissioned in your cluster at this point, but you could always just give that exact permission to that exact serviceaccount (I can walk you through that if you're not following) and see what happens
    ciaran

    ciaran

    1 year ago
    Here's the RBAC I applied when I created the agent:
    apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
    kind: Role
    metadata:
      name: prefect-agent-rbac
      namespace: ${BAKERY_NAMESPACE}
    rules:
    - apiGroups:
      - batch
      - extensions
      resources:
      - jobs
      verbs:
      - '*'
    - apiGroups:
      - ''
      resources:
      - events
      - pods
      verbs:
      - '*'
    ---
    apiVersion: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>
    kind: RoleBinding
    metadata:
      name: prefect-agent-rbac
      namespace: ${BAKERY_NAMESPACE}
    roleRef:
      apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
      kind: Role
      name: prefect-agent-rbac
    subjects:
    - kind: ServiceAccount
      name: default
    Tyler Wanner

    Tyler Wanner

    1 year ago
    hmm it seems you have the pods/* permission there
    ciaran

    ciaran

    1 year ago
    If it helps, this what I currently have setup in the flow (missing a few bits like imports/tasks etc)
    Nothing too wild I don't think
    Tyler Wanner

    Tyler Wanner

    1 year ago
    yeah namespaced RBAC gets pretty hairy at times, but I don't think it's anything in your flow--the error message and the RBAC permissions just don't seem to add up
    ciaran

    ciaran

    1 year ago
    😬
    Tyler Wanner

    Tyler Wanner

    1 year ago
    can you
    kubectl get
    your rolebinding and role just to make sure they're as-set
    I must be missing something
    ciaran

    ciaran

    1 year ago
    I can certainly do that. What's the syntax for that kind of command 😅
    Tyler Wanner

    Tyler Wanner

    1 year ago
    kubectl get rolebindings -n $BAKERY_NAMESPACE -o yaml
    ciaran

    ciaran

    1 year ago
    Ah nice, Thanks
    apiVersion: v1
    items:
    - apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
      kind: RoleBinding
      metadata:
        annotations:
          <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
            {"apiVersion":"<http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>","kind":"RoleBinding","metadata":{"annotations":{},"name":"prefect-agent-rbac","namespace":"pangeo-forge-azure-bakery"},"roleRef":{"apiGroup":"<http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>","kind":"Role","name":"prefect-agent-rbac"},"subjects":[{"kind":"ServiceAccount","name":"default"}]}
        creationTimestamp: "2021-04-30T15:33:05Z"
        managedFields:
        - apiVersion: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>
          fieldsType: FieldsV1
          fieldsV1:
            f:metadata:
              f:annotations:
                .: {}
                <f:kubectl.kubernetes.io/last-applied-configuration>: {}
            f:roleRef:
              f:apiGroup: {}
              f:kind: {}
              f:name: {}
            f:subjects: {}
          manager: kubectl-client-side-apply
          operation: Update
          time: "2021-04-30T15:33:05Z"
        name: prefect-agent-rbac
        namespace: pangeo-forge-azure-bakery
        resourceVersion: "3887"
        selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/pangeo-forge-azure-bakery/rolebindings/prefect-agent-rbac
        uid: 5e1207cb-10ab-4974-b2b6-91901c2d9f44
      roleRef:
        apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
        kind: Role
        name: prefect-agent-rbac
      subjects:
      - kind: ServiceAccount
        name: default
    kind: List
    metadata:
      resourceVersion: ""
      selfLink: ""
    Tyler Wanner

    Tyler Wanner

    1 year ago
    ok now can you try
    kubectl get roles -n $BAKERY_NAMESPACE prefect-agent-rbac -o yaml
    ciaran

    ciaran

    1 year ago
    apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
    kind: Role
    metadata:
      annotations:
        <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
          {"apiVersion":"<http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>","kind":"Role","metadata":{"annotations":{},"name":"prefect-agent-rbac","namespace":"pangeo-forge-azure-bakery"},"rules":[{"apiGroups":["batch","extensions"],"resources":["jobs"],"verbs":["*"]},{"apiGroups":[""],"resources":["events","pods"],"verbs":["*"]}]}
      creationTimestamp: "2021-04-30T15:33:05Z"
      managedFields:
      - apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:annotations:
              .: {}
              <f:kubectl.kubernetes.io/last-applied-configuration>: {}
          f:rules: {}
        manager: kubectl-client-side-apply
        operation: Update
        time: "2021-04-30T15:33:05Z"
      name: prefect-agent-rbac
      namespace: pangeo-forge-azure-bakery
      resourceVersion: "3885"
      selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/pangeo-forge-azure-bakery/roles/prefect-agent-rbac
      uid: 5adabb97-9970-4634-9f82-dcb4bc91d801
    rules:
    - apiGroups:
      - batch
      - extensions
      resources:
      - jobs
      verbs:
      - '*'
    - apiGroups:
      - ""
      resources:
      - events
      - pods
      verbs:
      - '*'
    Tyler Wanner

    Tyler Wanner

    1 year ago
    ah wow i see what we're missing
    pods/log
    is not actually covered by
    pods
    ciaran

    ciaran

    1 year ago
    Ooh.
    Tyler Wanner

    Tyler Wanner

    1 year ago
    resources:
      - events
      - pods
      - pods/log
    ^^ try that in your role
    ciaran

    ciaran

    1 year ago
    Whahay, we're getting there! Another error mind:
    HTTP response headers: <CIMultiDictProxy('Audit-Id': '80b5f484-e14a-4097-94cb-f3339f4d1356', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 30 Apr 2021 16:57:23 GMT', 'Content-Length': '332')>
    HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"services is forbidden: User \"system:serviceaccount:pangeo-forge-azure-bakery:default\" cannot create resource \"services\" in API group \"\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"kind":"services"},"code":403}
    Guessing I should try services in that list too?
    Tyler Wanner

    Tyler Wanner

    1 year ago
    you got it
    ciaran

    ciaran

    1 year ago
    Pffft. K8s whizz me
    Okay, well the flow is running.
    I think I'm now onto Dask config fun
    Tyler Wanner

    Tyler Wanner

    1 year ago
    that is where I tag out 🙂 other members of the team will be able to help more there
    ciaran

    ciaran

    1 year ago
    Thanks for the help again @Tyler Wanner! Are you DC based? I owe you a drink of some kind when I can eventually fly out to Development Seeds office!
    Don't suppose you've got a name I can ping with regards to what seems to be a Dask Cluster just stuck in creation?
    Oh, scrap that, I don't think it's dask
    message: '0/2 nodes are available: 2 Insufficient memory.'
    Looks like I probably need to set adaptive scaling of some sorts.
    Tyler Wanner

    Tyler Wanner

    1 year ago
    very DC, very based, I look forward to it!
    yeah that's claiming that your dask cluster's resource requests are higher than any individual node has available for scheduling--you may need to increase your instance type
    ciaran

    ciaran

    1 year ago
    Hmmm the VMs in this block
    default_node_pool {
        name            = "default"
        node_count      = 2
        vm_size         = "Standard_D2_v2"
        os_disk_size_gb = 30
      }
    Have 7GB and I'm asking for 4. I wonder if I've just used up the nodes I have? 🤷
    enable_auto_scaling
    , I should probably set that to true 😅
    Tyler Wanner

    Tyler Wanner

    1 year ago
    then yep should be able to just turn on auto scaling or increase your node count
    ciaran

    ciaran

    1 year ago
    Autoscaling sounds good to me 😄 I'll let you know how it goes
    Hmmm. Turned on autoscaling and I still get
    reason: FailedScheduling
    message: '0/1 nodes are available: 1 Insufficient memory.'
    So i can have a maximum of 1000 nodes (all 7GB vms...), surely if it needs 4GB, that spins up a new node?
    Oh, another error
    HTTP response headers: <CIMultiDictProxy('Audit-Id': '84d06158-c85b-4c3b-b093-6d0dc9884c5f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 30 Apr 2021 17:47:48 GMT', 'Content-Length': '398')>
    HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"poddisruptionbudgets.policy is forbidden: User \"system:serviceaccount:pangeo-forge-azure-bakery:default\" cannot create resource \"poddisruptionbudgets\" in API group \"policy\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"group":"policy","kind":"poddisruptionbudgets"},"code":403}
    To the config!
    - apiGroups:
      - policy
      resources:
      - poddisruptionbudgets
      verbs:
      - '*'
    ?
    Been sat like this for 12 minutes 😧
    Tyler Wanner

    Tyler Wanner

    1 year ago
    any update?
    ciaran

    ciaran

    1 year ago
    After about 30 mins I gave up. There was no errors and no events in AKS that would point to something being wrong
    It looked like it was just sitting there...
    Tyler Wanner

    Tyler Wanner

    1 year ago
    did you happen to check the dask root pod logs?
    ciaran

    ciaran

    1 year ago
    It didn't have any 😬
    I'll double check this later (signing off for the weekend)
    Tyler Wanner

    Tyler Wanner

    1 year ago
    enjoy your weekend!
    ciaran

    ciaran

    1 year ago
    You too! Again, thanks for the help!
    Okay, sorry it took forever to get back to this.
    This is the events list that AKS has for my bakeries namespace (newest at the top) These warnings happened when I invoked the flow, it looks like that's solved as the dask-root pod is now green
    However it has 0 logs
    And doesn't appear to be scheduling anything
    Here's all the pods currently running
    The current state of the flow logs
    Current flow state
    Kevin Kho

    Kevin Kho

    1 year ago
    So it’s submitted for execution but not running?
    ciaran

    ciaran

    1 year ago
    Well, it says it's running
    And the Dask Scheduler spins up
    But then that's it
    Kevin Kho

    Kevin Kho

    1 year ago
    This is still the same flow code so you’re expecting the
    say_hello
    ?
    ciaran

    ciaran

    1 year ago
    Yep, pretty simple flow
    Still going 🤣 Still 0 logs in the Dask Scheduler pod
    Kevin Kho

    Kevin Kho

    1 year ago
    Hey @ciaran, can you repost a new thread in the community channel and then I can get more eyes on it?
    ciaran

    ciaran

    1 year ago
    Sure!