https://prefect.io logo
c

ciaran

04/21/2021, 11:55 AM
Hey folks! Does anyone have an example repo they could share that's handling the deployment of a Prefect Agent with AKS? Currently starting off with spinning up a cluster with Terraform, but my k8s skills are sub-par so some useful starters would be handy. For reference i've CDK'd up a Prefect Agent & respective cluster in ECS on AWS before, but that wasn't involving k8s.
t

Tyler Wanner

04/21/2021, 12:07 PM
Hiya! Did you check out the cli command? i’ll grab the doc. I’ve also used terraform to deploy agents myself but prefect agent kubernetes install will generate a yaml manifest for you
c

ciaran

04/21/2021, 12:09 PM
Hey @Tyler Wanner yeah I got to the CLI part and then wanted to see if anyone had done it in a IAC manner - Is there a certain terraform provider that can run that manifest?
t

Tyler Wanner

04/21/2021, 12:10 PM
there’s not a fully supported terraform provider for handling raw yaml at the moment but I can share with you an example agent terraform config if you’d like
c

ciaran

04/21/2021, 12:13 PM
If you could that'd be amazing
I'm out of my depth in Azure and k8s 🤣 AWS is my happy place
t

Tyler Wanner

04/21/2021, 12:18 PM
well with Prefect and AKS you shouldn't need to worry too too much about k8s to get going!
this isn't a "supported" install pattern so no guarantees but here's an all-in example that will create a prefect agent and the rbac as if you used
prefect agent kubernetes install
Copy code
resource "kubernetes_namespace" "ci" {
  metadata {
    name = "prefect"
  }
}

resource "kubernetes_role" "prefect_agent" {
  metadata {
    name      = "prefect-agent"
    namespace = kubernetes_namespace.ci.metadata[0].name
  }

  rule {
    api_groups = ["batch", "extensions"]
    resources  = ["jobs"]
    verbs      = ["*"]
  }
  rule {
    api_groups = [""]
    resources  = ["events", "pods"]
    verbs      = ["*"]
  }
}

resource "kubernetes_role_binding" "prefect_agent" {
  metadata {
    name      = "prefect-agent"
    namespace = kubernetes_namespace.ci.metadata[0].name
  }

  role_ref {
    api_group = "<http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>"
    kind      = "Role"
    name      = kubernetes_role.prefect_agent.metadata[0].name
  }

  subject {
    kind      = "ServiceAccount"
    name      = kubernetes_service_account.agent.metadata[0].name
    namespace = kubernetes_namespace.ci.metadata[0].name
  }
}

resource "kubernetes_service_account" "agent" {
  metadata {
    name      = "agent"
    namespace = kubernetes_namespace.ci.metadata[0].name
  }
}



resource "kubernetes_deployment" "deployment" {
  metadata {
    name      = <http://var.app|var.app>
    namespace = kubernetes_namespace.ci.metadata[0].name
  }

  spec {
    replicas = "1"

    selector {
      match_labels = {
        app = <http://var.app|var.app>
      }
    }

    template {
      metadata {
        labels = {
          app = <http://var.app|var.app>
        }
      }

      spec {
        service_account_name            = kubernetes_service_account.agent.metadata[0].name
        automount_service_account_token = true

        container {
          args    = ["prefect agent kubernetes start"]
          command = ["/bin/bash", "-c"]

          env {
            name  = "PREFECT__CLOUD__AGENT__AUTH_TOKEN"
            value = var.auth_token
          }

          env {
            name  = "PREFECT__CLOUD__AGENT__AGENT_ADDRESS"
            value = "http://:8080"
          }

          env {
            name  = "NAMESPACE"
            value = kubernetes_namespace.ci.metadata[0].name
          }

          env {
            name  = "PREFECT__CLOUD__AGENT__LABELS"
            value = "['foo']"
          }

          dynamic env {
            for_each = var.env_vars
            content {
              name  = env.name
              value = env.value
            }
          }

          image             = "prefecthq/prefect:${var.prefect_version}"
          name              = <http://var.app|var.app>
          image_pull_policy = "Always"

          liveness_probe {
            http_get {
              path = "/api/health"
              port = 8080
            }

            failure_threshold     = 2
            initial_delay_seconds = 40
            period_seconds        = 40
          }

          resources {
            limits {
              cpu    = "500m"
              memory = "128Mi"
            }
          }
        }
      }
    }
  }
}

variable "auth_token" {}
variable "app" { default = "prefect-agent" }
variable "prefect_version" { default = "latest" }
variable "env_vars" { 
    type = map 
    default = null
}
👀 1
mind the agent configuration, especially the addition of the "foo" label (which will probably not pick up any of your flows, unless that label is present on the flow/ flow run)
you'll need to supply an auth_token, for which you'll want to use a Prefect Cloud serviceaccount api key
Also I do believe we've removed the resources block from the generic install template. It's best to set them at a proper level, but you may just want to remove them to get started
c

ciaran

04/21/2021, 12:28 PM
Cool thanks for this! I'll take a look!
👍 1
t

Tyler Wanner

04/21/2021, 12:40 PM
let me know how it goes!
btw I left out the k8s provider configuration... for that, reference the provider docs directly, as your interaction pattern will determine how you set that up https://registry.terraform.io/providers/hashicorp/kubernetes/1.11.0/docs
c

ciaran

04/21/2021, 12:57 PM
So I guess the alternative to your TF example is deploying the AKS cluster with TF then just using
kubectl
to apply that manifest?
@Tyler Wanner off-topic, but I'm pretty sure I watched a demo you did on Youtube earlier this week 🤣
🙌 1
t

Tyler Wanner

04/21/2021, 1:04 PM
yep @ciaran the easiest way to deploy the agent is
prefect kubernetes agent install --rbac --namespace NAMESPACE -t TOKEN | kubectl apply -n NAMESPACE -f -
🤯 1
personally, i'm a big fan of declarative infrastructure code so I use a mixture of both to manage my k8s prefect agents
c

ciaran

04/21/2021, 1:05 PM
Awesome thanks, appreciate the help! Yeah coming from CDK I'm definitely learning towards preferring doing this in Terraform
Just wanted to get my head around what that's doing under the hood too
t

Tyler Wanner

04/21/2021, 1:06 PM
well then you're asking all the right questions 👍
c

ciaran

04/21/2021, 1:15 PM
So, slightly dim question, your TF example, how does it get placed into the AKS cluster? I don't see a reference to a cluster
t

Tyler Wanner

04/21/2021, 1:16 PM
that's part of the kubernetes provider configuration
you can either inherit your local kube context or set up a link to a particular cluster
c

ciaran

04/21/2021, 1:21 PM
Interesting. Ah I see so actually the deployment of my AKS cluster is separate to my Prefect Agent setup? I'll need two terraform 'projects' kind of
t

Tyler Wanner

04/21/2021, 1:21 PM
in my experience, yes, but I'm not sure that's 100% true
If the provider configuration is dependent upon that cluster's state (it is the way I do it) or existence, then you'll be much better off separating them
c

ciaran

04/21/2021, 1:23 PM
Whelp beat me to it.
Eurgh. Forgot how chicken and egg terraform was. Cloudformation spoils me
So @Tyler Wanner I'm trying to run:
Copy code
prefect agent kubernetes install -t "<token>" --rbac -n "pangeo-forge-azure-bakery" -l "ciaran-dev" | kubectl apply -f --namespace=pangeo-forge-azure-bakery -
Based on https://docs.prefect.io/orchestration/agents/kubernetes.html#running-in-cluster But I'm getting:
Copy code
error: Unexpected args: [-]
If I remove the
-
I instead get:
Copy code
error: the path "--namespace=pangeo-forge-azure-bakery" does not exist
t

Tyler Wanner

04/21/2021, 2:28 PM
the - needs to be an argument passed to -f
can u try moving -f to after —namespace?
if that works then we’ll need to fix the docs
c

ciaran

04/21/2021, 2:29 PM
Copy code
Warning: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1> RoleBinding is deprecated in v1.17+, unavailable in v1.22+; use <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1> RoleBinding
Error from server (NotFound): error when creating "STDIN": namespaces "pangeo-forge-azure-bakery" not found
Error from server (NotFound): error when creating "STDIN": namespaces "pangeo-forge-azure-bakery" not found
Error from server (NotFound): error when creating "STDIN": namespaces "pangeo-forge-azure-bakery" not found
Bit further haha
t

Tyler Wanner

04/21/2021, 2:31 PM
did you create the namespace?
c

ciaran

04/21/2021, 2:31 PM
Oh, does it need to be created beforehand?
I assumed this did it also
t

Tyler Wanner

04/21/2021, 2:31 PM
if you’re plopping into a non-default namespace you’ll have to create it yea
🙌 1
c

ciaran

04/21/2021, 2:32 PM
TIL
t

Tyler Wanner

04/21/2021, 2:32 PM
this is because the prefect CLI doesn’t know anything about your kubernetes environment it’s just generating a yaml manifest
so we can’t assume to create one
c

ciaran

04/21/2021, 2:33 PM
That's fair enough. Annoyingly it looks like
azurerm_kubernetes_cluster
doesn't offer that option.
So
kubernetes
provider/kubectl it is
t

Tyler Wanner

04/21/2021, 2:33 PM
Copy code
kubectl create namespace NAMESPACE
will do it
the cluster resource will not provide you with an interface for configuring namespaces, surely
c

ciaran

04/21/2021, 2:37 PM
🤷 Probably obvious but k8s is new to me haha
t

Tyler Wanner

04/21/2021, 2:38 PM
not much about k8s is “obvious” but once u get over the learning curve, it’s great magic
🤣 1
fortunately you won’t need to know much more than this for your prefect agent to be able to make good use of it but surely you will learn through debugging things that arise inevitably
c

ciaran

04/21/2021, 2:40 PM
🦜 Wahahay
🚀 1
marvin 1
t

Tyler Wanner

04/21/2021, 2:41 PM
happy k8sing!
c

ciaran

04/21/2021, 2:41 PM
Many thanks! Really appreciate it
Shall I raise an issue about the docs?
t

Tyler Wanner

04/21/2021, 2:42 PM
you may if you’d like, and I’ll take care of it
or feel free to submit a PR if you’d like to contribute
c

ciaran

04/21/2021, 2:47 PM
👍 Oh when in Rome.
🙌 1
I'll raise a PR
t

Tyler Wanner

04/21/2021, 9:04 PM
awesome thanks for that @ciaran 🙏
c

ciaran

04/23/2021, 12:27 PM
Hey @Tyler Wanner sorry for reviving this, I've been trying run a very simple flow and I'm hitting this:
Copy code
(403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'af45fea8-a5be-4d4a-a50c-fc8875a83144', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 23 Apr 2021 12:24:44 GMT', 'Content-Length': '329'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch is forbidden: User \"system:serviceaccount:default:default\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"group":"batch","kind":"jobs"},"code":403}
The yaml I'm applying to the cluster looks like:
Copy code
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: prefect-agent
  name: prefect-agent
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prefect-agent
  template:
    metadata:
      labels:
        app: prefect-agent
    spec:
      containers:
      - args:
        - prefect agent kubernetes start
        command:
        - /bin/bash
        - -c
        env:
        - name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
          value: ${PREFECT__CLOUD__AGENT__AUTH_TOKEN}
        - name: PREFECT__CLOUD__API
          value: <https://api.prefect.io>
        - name: NAMESPACE
          value: ${BAKERY_NAMESPACE}
        - name: IMAGE_PULL_SECRETS
          value: ''
        - name: PREFECT__CLOUD__AGENT__LABELS
          value: '${PREFECT__CLOUD__AGENT__LABELS}'
        - name: JOB_MEM_REQUEST
          value: ''
        - name: JOB_MEM_LIMIT
          value: ''
        - name: JOB_CPU_REQUEST
          value: ''
        - name: JOB_CPU_LIMIT
          value: ''
        - name: IMAGE_PULL_POLICY
          value: ''
        - name: SERVICE_ACCOUNT_NAME
          value: ''
        - name: PREFECT__BACKEND
          value: cloud
        - name: PREFECT__CLOUD__AGENT__AGENT_ADDRESS
          value: http://:8080
        image: prefecthq/prefect:0.14.16-python3.8
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /api/health
            port: 8080
          initialDelaySeconds: 40
          periodSeconds: 40
        name: agent
---
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
  name: prefect-agent-rbac
  namespace: default
rules:
- apiGroups:
  - batch
  - extensions
  resources:
  - jobs
  verbs:
  - '*'
- apiGroups:
  - ''
  resources:
  - events
  - pods
  verbs:
  - '*'
---
apiVersion: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>
kind: RoleBinding
metadata:
  name: prefect-agent-rbac
  namespace: default
roleRef:
  apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
  kind: Role
  name: prefect-agent-rbac
subjects:
- kind: ServiceAccount
  name: default
And my
KubernetesRun
config looks like:
Copy code
run_config=KubernetesRun(
        image="prefecthq/prefect:0.14.16-python3.8",
        labels=json.loads(os.environ["PREFECT__CLOUD__AGENT__LABELS"]),
    ),
Any ideas? Appreciate it!
t

Tyler Wanner

04/23/2021, 1:25 PM
ah yep that's a missing RBAC permission
Copy code
"Failure","message":"jobs.batch is forbidden: User \"system:serviceaccount:default:default\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"pangeo-forge-azure-bakery\""
^^ this is saying that something in the default namespace, with the serviceaccount default, cannot create jobs in the namespace pangeo-forge-azure-bakery
is this log in the agent?
c

ciaran

04/23/2021, 1:27 PM
I think I've not templated the namespace name correctly.
t

Tyler Wanner

04/23/2021, 1:29 PM
that's certainly the right vein--let me check out the full file real quick
it looks like your agent deployment is still in the default namespace--did you mean to run it in your bakery namespace with the jobs it creates?
c

ciaran

04/23/2021, 1:31 PM
Yeah I'll be honest I think what I've done is used prefect to generate that yaml file, then not fully templated out the bakery namespace.
t

Tyler Wanner

04/23/2021, 1:32 PM
as long as you apply with -n NAMESPACE you should be ok
otherwise, just add
namespace: pangeo-forge-azure-bakery
on line 8
c

ciaran

04/23/2021, 1:34 PM
Inside the
spec
section?
t

Tyler Wanner

04/23/2021, 1:34 PM
no sorry, between line 7 and 8, same as in the other resources--in metadata
c

ciaran

04/23/2021, 1:34 PM
gotcha
t

Tyler Wanner

04/23/2021, 1:36 PM
your original file before that PR you linked should work if you just specify -n NAMESPACE in the kubectl apply command btw
but now you're explicitly stating htat in the configuration
c

ciaran

04/23/2021, 1:36 PM
Ahh okay. Yeah I was hoping to get the overall template and then folks can dynamically set it. Mainly because there could be multiple deployments etc.
t

Tyler Wanner

04/23/2021, 1:37 PM
OK then don't make these changes, and just change the apply command
c

ciaran

04/23/2021, 1:38 PM
So, progress.
Copy code
Failed to load and execute Flow's environment: AttributeError("'NoneType' object has no attribute 'rstrip'")
t

Tyler Wanner

04/23/2021, 1:41 PM
hmm i'm not sure on that one and i'll have to step away a while but glad to get past that one
c

ciaran

04/23/2021, 1:42 PM
Okay no problem thanks for the help so far!
t

Tyler Wanner

04/23/2021, 4:15 PM
can you check out the version of prefect that built the flow and the version of prefect in the flow storage image?
it sounds like a version mismatch
c

ciaran

04/23/2021, 4:17 PM
Both the agent and flow are set to use
prefecthq/prefect:0.14.16-python3.8
and my local install is also
0.14.16
on Python
3.8.6
🤔 1
t

Tyler Wanner

04/23/2021, 4:17 PM
paging @Kevin Kho
☎️ 1
c

ciaran

04/23/2021, 4:18 PM
The only thing that may be different is I installed
prefect[azure, kubernetes]
locally...
k

Kevin Kho

04/23/2021, 4:19 PM
Hi @ciaran! How big is your flow code?
Do you think you can share it?
k

Kevin Kho

04/23/2021, 4:21 PM
How did you register this flow? Running the Python script?
c

ciaran

04/23/2021, 4:29 PM
Yep, just running it with python
k

Kevin Kho

04/23/2021, 4:33 PM
My best advice is to try Github storage first to determine if it’s a storage problem. I am also wondering if your labels comes out as a list of strings. Could you check that?
c

ciaran

04/23/2021, 4:34 PM
It's available in the agent under the same env var name
(I had found that thread 😅 )
I'll double check my labels.
k

Kevin Kho

04/23/2021, 4:35 PM
Github storage will help us check if it’s an Azure storage issue
c

ciaran

04/23/2021, 4:36 PM
Copy code
>>> import json
>>> import os
>>> json.loads(os.environ["PREFECT__CLOUD__AGENT__LABELS"])
['ciarandev']
Tried it in the python interpreter, looks like a list.
k

Kevin Kho

04/23/2021, 4:36 PM
that looks good
c

ciaran

04/23/2021, 4:36 PM
I might have to try the Github storage monday. Finishing up for the day here 😅 Thanks for the pointers though, I'll push with Github storage first thing monday/this weekend if I get a chance and let you know!
k

Kevin Kho

04/23/2021, 4:36 PM
Sure thing!
c

ciaran

04/23/2021, 4:38 PM
This was much more in my comfort zone on AWS and ECS 🤣 Feel like Azure & k8s is polar opposite lol
So just about to try this out, I'm assuming that
access_token_secret
in https://docs.prefect.io/api/latest/storage.html#github is only necessary for private repositories?
Gave it a go without the
access_token_secret
. I get:
Which makes sense as this value would be serialised on a non-repo storage target. So looks like it can at least download and attempt to run the flow via Github
So I'm guessing this is pointing us to Azure Storage being an issue
k

Kevin Kho

04/26/2021, 3:07 PM
Hey @ciaran, that does seem to be the case. I would make sure that the values are working. I wanna mention that I literally handed someone your aws recipes code for them to understand the ECS task definition.
c

ciaran

04/26/2021, 3:12 PM
Haha oh goodness. I hope they found it handy! By making sure the values are working, what do you mean?
k

Kevin Kho

04/26/2021, 3:13 PM
That the storage connection string works outside of Prefect
c

ciaran

04/26/2021, 3:16 PM
Ah right. Got you
k

Kevin Kho

04/26/2021, 3:16 PM
I’ll be able to try your flow later today do help diagnose.
c

ciaran

04/26/2021, 3:21 PM
Okay, for reference I ran:
Copy code
>>> con_string = "<the con string>"
>>> from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
>>> blob_service_client = BlobServiceClient.from_connection_string(con_string)
>>> for container in blob_service_client.list_containers():
...     print(container)
... 
{'name': 'ciarandev-bakery-flow-storage-container', 'last_modified': datetime.datetime(2021, 4, 26, 8, 14, 17, tzinfo=datetime.timezone.utc), 'etag': '"0x8D9088B4947D0D2"', 'lease': {'status': 'unlocked', 'state': 'available', 'duration': None}, 'public_access': None, 'has_immutability_policy': False, 'deleted': None, 'version': None, 'has_legal_hold': False, 'metadata': None, 'encryption_scope': <azure.storage.blob._models.ContainerEncryptionScope object at 0x106ce0f10>}
So looks like the connection string is correct.
k

Kevin Kho

04/26/2021, 3:23 PM
I guess it might be a Docker container and K8s? Is your container for this example minimal?
c

ciaran

04/26/2021, 3:23 PM
Oh well actually we knew the connection string worked as it successfully stores the flow, it just falls over running it.
The container is literally just the
prefecthq/prefect:0.14.16-python3.8
image
k

Kevin Kho

04/26/2021, 3:24 PM
Perfect. Will test later.
c

ciaran

04/26/2021, 3:25 PM
Cool thanks, appreciate it!
And so that it's beside the links, this is the error I get:
Copy code
Failed to load and execute Flow's environment: AttributeError("'NoneType' object has no attribute 'rstrip'")
👍 1
k

Kevin Kho

04/26/2021, 10:04 PM
I replicated this on AzureStorage + LocalRun. Trying to figure out how to fix now.
The issue here is that the
AzureStorage
class does not store the connection_string needed to retrieve your flow. This might be something we need to fix on our end (I raised the issue to the team). In the meantime, you need to pass an environment variable to get around this.
Copy code
with Flow(
    "azure_flow",
    run_config=LocalRun(env={"AZURE_STORAGE_CONNECTION_STRING": connection_string}),
    storage=storage.Azure(
        container="test",
        connection_string=connection_string,
    )
) as flow:
    hello_result = say_hello()
c

ciaran

04/27/2021, 8:23 AM
Hey @Kevin Kho thanks for looking into this! So do I even need to provide
connection_string
to
storage.Azure
? If the environment variable is the one it uses?
Awesome, so adding that as an environment variable to the run config for K8s does seem to work. Be interested still in whether it's actually needed elsewhere
k

Kevin Kho

04/27/2021, 1:07 PM
You need it in Azure storage because that’s for uploading
c

ciaran

04/27/2021, 1:08 PM
Ah Ok. I assumed that also covered the downloading
k

Kevin Kho

04/27/2021, 1:09 PM
Well it does for other storage classes. We’ll be opening an issue for this.
c

ciaran

04/27/2021, 1:11 PM
Ahh gotcha.
Whilst I have this Big 'Ol Thread - Do you have any folks that know of a nice way to deal with Prefect & AKS logs? The process of finding the logs for just 1 flow run is quite janky at the mo (and it's likely not a Prefect issue)
k

Kevin Kho

04/27/2021, 1:15 PM
You work with Sean right? Yeah, unfortunately this is a Dask issue where Dask doesn’t natively move the logs around. Attaching the CloudHandler to the logger before it’s sent to Dask workers doesn’t quite work because the logger gets reinstantiated on the worker side. You need some kind of service to write the logs to another location and collect it from there.
c

ciaran

04/27/2021, 1:29 PM
So actually I think this is a slightly different, AKS specific issue But yeah I work with Sean, we're putting y'all in your paces 😅
k

Kevin Kho

04/27/2021, 1:30 PM
Ah I haven’t used AKS. What do you think is the issue? I can take a look some time.
1
c

ciaran

04/27/2021, 1:30 PM
For reference, this is how I'm managing to view the logs for the flow I ran: https://github.com/pangeo-forge/pangeo-forge-azure-bakery/tree/add-k8s-cluster#logging
Essentially unlike something like ECS where a flow is a task and I can click through to the exact Cloudwatch logs, with the Azure logging, I have to do some janky queries to get the container ID of the pod that ran the flow then query for the logs for that ID
But given I've very little AKS experience, I'm probably just doing it a really backwards way lol
k

Kevin Kho

04/27/2021, 1:35 PM
Oh ok I see what you mean. Yeah, outside of Prefect, but I have a bit of Azure experience in general. Will take a quick look and see what I find sometime.
c

ciaran

04/27/2021, 1:36 PM
Cool thanks, no rush as it's only really a pain if something goes wrong and I need to dig into the AKS side of the logging
👍 1
It might just be a 'thing' 🤷
So, trying out setting up DaskExecutor on my Kubernetes flows
Copy code
def get_cluster():
    pod_spec = make_pod_spec(
        image="prefecthq/prefect:0.14.16-python3.8",
        labels={"flow": flow_name},
        memory_limit='4G',
        memory_request='4G'
    )
    return KubeCluster(pod_spec)

...


executor=DaskExecutor(
    cluster_class=get_cluster(),
)
Gives me:
Copy code
Traceback (most recent call last):
  File "flow_test/manual_flow.py", line 8, in <module>
    from dask_kubernetes import KubeCluster, make_pod_spec
  File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/__init__.py", line 3, in <module>
    from .core import KubeCluster
  File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/core.py", line 19, in <module>
    from .objects import (
  File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/objects.py", line 34, in <module>
    SERIALIZATION_API_CLIENT = DummyApiClient()
  File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/objects.py", line 28, in __init__
    self.configuration = Configuration.get_default_copy()
AttributeError: type object 'Configuration' has no attribute 'get_default_copy'
k

Kevin Kho

04/27/2021, 4:16 PM
I think
cluster_class
should be callable. Can you try removing the
()
?
But I think the ideal situation is you put in
KubeCluster
there and use the
cluster_kwargs
to pass the pod spec.
c

ciaran

04/27/2021, 4:20 PM
Hmmm okay, what shape would the pod spec look like in
cluster_kwargs
?
Copy code
executor=DaskExecutor(
    cluster_class="dask_kubernetes.KubeCluster",
    cluster_kwargs={
        "image": "prefecthq/prefect:0.14.16-python3.8",
        "labels": {"flow": flow_name},
        "memory_limit": "4G",
        "memory_request": "4G"
    }
)
Gives the same error
k

Kevin Kho

04/27/2021, 4:28 PM
This seems right. Can you try passing the class directly instead of a string? (probably will be the same though)
Am I reading it right that the error is from the port though? I would check the versioning where it was registered and where the agent is running.
c

ciaran

04/27/2021, 4:32 PM
I should point out this is during registration
Directly vs string didn't do much
Huh
Now it just fails even with it commented out.
Oh it's failing on the
from dask_kubernetes import KubeCluster, make_pod_spec
line
k

Kevin Kho

04/27/2021, 4:35 PM
Oh that’s what I meant from the last message. Yeah it’s failing on the import so it seems like something is wrong with Kubecluster independent of Prefect
c

ciaran

04/27/2021, 4:35 PM
Gotcha.
Copy code
dask-kubernetes 2021.3.0
Ah. Looks like
prefect[kubernetes]
is installing an older version
Copy code
kubernetes                           11.0.0b2
So I guess either I bump Kubernetes, or Prefect does 😅
k

Kevin Kho

04/27/2021, 4:44 PM
I see
c

ciaran

04/27/2021, 4:50 PM
Shall I raise an issue @Kevin Kho?
k

Kevin Kho

04/27/2021, 4:51 PM
Is bringing kubernetes down to that an option for your use case?
c

ciaran

04/27/2021, 4:54 PM
So the error I'm seeing is because Prefect is installing a older version of Kubernetes.
dask-kubernetes
has that error if k8s is less than
v12
, but
dask-kubernetes
was installed via Prefect
So Prefect is currently installing 1 lib that relies on another it installs, but the versions aren't compatible
So maybe a lower version of
dask-kubernetes
might work, but it feels like we shouldn't drop versions down
k

Kevin Kho

04/27/2021, 4:56 PM
Oh I see. You can open an issue for sure but i’m sure you’re aware that may take a while to complete due to compatibility testing
I get what you mean now about bumping it up.
c

ciaran

04/27/2021, 4:59 PM
Sure, I mean I'll install a newer k8s version locally to see if that fixes it
If it does, then I can at least recommend extending the versions of
kubernetes
that
prefect
installs.
👍 1
@Kevin Kho I had to manually bump
kubernetes
to
12.0.1
via pip because Poetry wouldn't let me do it via that (as Prefects range doesn't include it). However, when I uncommented out those dask_kubernetes imports, the flow registration happened without error. So I think there's definitely evidence of Prefect pulling in a Kubernetes version that is incompatible with the
dask_kubernetes
version it pulls in.
I've raised https://github.com/PrefectHQ/prefect/issues/4451 which hopefully contains enough information
k

Kevin Kho

04/28/2021, 1:30 PM
Thanks for raising @ciaran! and for helping Ben!
1
c

ciaran

04/30/2021, 4:04 PM
👋 me again 😅 So, I manually installed
kubernetes==12.0.1
and made a image for my agents/pods that is just the
prefect==0.14.17
image, with the newer k8s version installed
Managed to use
KubeCluster
in my Flow registration
Running it, I get:
Copy code
HTTP response headers: <CIMultiDictProxy('Audit-Id': 'e60cf00e-93e6-4fb0-802e-75a298fa0867', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 30 Apr 2021 16:01:26 GMT', 'Content-Length': '386')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"dask-root-857b2f2e-0k55nf\" is forbidden: User \"system:serviceaccount:pangeo-forge-azure-bakery:default\" cannot get resource \"pods/log\" in API group \"\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"name":"dask-root-857b2f2e-0k55nf","kind":"pods"},"code":403}
Guessing I've hit more k8s errors...
@Tyler Wanner probably your territory this one!
t

Tyler Wanner

04/30/2021, 4:10 PM
heya--we shipped a PR to master about the kubernetes version btw
1
that's definitely a kubernetes RBAC permissioning problem
your dask root pod is using pangeo-forge-azure-bakery/default as a serviceaccount, can you tell me what your agent is using?
not sure what RBAC you have permissioned in your cluster at this point, but you could always just give that exact permission to that exact serviceaccount (I can walk you through that if you're not following) and see what happens
c

ciaran

04/30/2021, 4:15 PM
Here's the RBAC I applied when I created the agent:
Copy code
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
  name: prefect-agent-rbac
  namespace: ${BAKERY_NAMESPACE}
rules:
- apiGroups:
  - batch
  - extensions
  resources:
  - jobs
  verbs:
  - '*'
- apiGroups:
  - ''
  resources:
  - events
  - pods
  verbs:
  - '*'
---
apiVersion: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>
kind: RoleBinding
metadata:
  name: prefect-agent-rbac
  namespace: ${BAKERY_NAMESPACE}
roleRef:
  apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
  kind: Role
  name: prefect-agent-rbac
subjects:
- kind: ServiceAccount
  name: default
t

Tyler Wanner

04/30/2021, 4:22 PM
hmm it seems you have the pods/* permission there
c

ciaran

04/30/2021, 4:26 PM
If it helps, this what I currently have setup in the flow (missing a few bits like imports/tasks etc)
Nothing too wild I don't think
t

Tyler Wanner

04/30/2021, 4:30 PM
yeah namespaced RBAC gets pretty hairy at times, but I don't think it's anything in your flow--the error message and the RBAC permissions just don't seem to add up
c

ciaran

04/30/2021, 4:30 PM
😬
t

Tyler Wanner

04/30/2021, 4:30 PM
can you
kubectl get
your rolebinding and role just to make sure they're as-set
I must be missing something
c

ciaran

04/30/2021, 4:31 PM
I can certainly do that. What's the syntax for that kind of command 😅
t

Tyler Wanner

04/30/2021, 4:34 PM
kubectl get rolebindings -n $BAKERY_NAMESPACE -o yaml
c

ciaran

04/30/2021, 4:35 PM
Ah nice, Thanks
Copy code
apiVersion: v1
items:
- apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
  kind: RoleBinding
  metadata:
    annotations:
      <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
        {"apiVersion":"<http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>","kind":"RoleBinding","metadata":{"annotations":{},"name":"prefect-agent-rbac","namespace":"pangeo-forge-azure-bakery"},"roleRef":{"apiGroup":"<http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>","kind":"Role","name":"prefect-agent-rbac"},"subjects":[{"kind":"ServiceAccount","name":"default"}]}
    creationTimestamp: "2021-04-30T15:33:05Z"
    managedFields:
    - apiVersion: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            <f:kubectl.kubernetes.io/last-applied-configuration>: {}
        f:roleRef:
          f:apiGroup: {}
          f:kind: {}
          f:name: {}
        f:subjects: {}
      manager: kubectl-client-side-apply
      operation: Update
      time: "2021-04-30T15:33:05Z"
    name: prefect-agent-rbac
    namespace: pangeo-forge-azure-bakery
    resourceVersion: "3887"
    selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/pangeo-forge-azure-bakery/rolebindings/prefect-agent-rbac
    uid: 5e1207cb-10ab-4974-b2b6-91901c2d9f44
  roleRef:
    apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
    kind: Role
    name: prefect-agent-rbac
  subjects:
  - kind: ServiceAccount
    name: default
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
t

Tyler Wanner

04/30/2021, 4:39 PM
ok now can you try
kubectl get roles -n $BAKERY_NAMESPACE prefect-agent-rbac -o yaml
c

ciaran

04/30/2021, 4:41 PM
Copy code
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
  annotations:
    <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
      {"apiVersion":"<http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>","kind":"Role","metadata":{"annotations":{},"name":"prefect-agent-rbac","namespace":"pangeo-forge-azure-bakery"},"rules":[{"apiGroups":["batch","extensions"],"resources":["jobs"],"verbs":["*"]},{"apiGroups":[""],"resources":["events","pods"],"verbs":["*"]}]}
  creationTimestamp: "2021-04-30T15:33:05Z"
  managedFields:
  - apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          <f:kubectl.kubernetes.io/last-applied-configuration>: {}
      f:rules: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2021-04-30T15:33:05Z"
  name: prefect-agent-rbac
  namespace: pangeo-forge-azure-bakery
  resourceVersion: "3885"
  selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/pangeo-forge-azure-bakery/roles/prefect-agent-rbac
  uid: 5adabb97-9970-4634-9f82-dcb4bc91d801
rules:
- apiGroups:
  - batch
  - extensions
  resources:
  - jobs
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - events
  - pods
  verbs:
  - '*'
t

Tyler Wanner

04/30/2021, 4:53 PM
ah wow i see what we're missing
pods/log
is not actually covered by
pods
c

ciaran

04/30/2021, 4:53 PM
Ooh.
t

Tyler Wanner

04/30/2021, 4:54 PM
Copy code
resources:
  - events
  - pods
  - pods/log
^^ try that in your role
c

ciaran

04/30/2021, 4:55 PM
running
🚀 1
🤣 1
Whahay, we're getting there! Another error mind:
Copy code
HTTP response headers: <CIMultiDictProxy('Audit-Id': '80b5f484-e14a-4097-94cb-f3339f4d1356', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 30 Apr 2021 16:57:23 GMT', 'Content-Length': '332')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"services is forbidden: User \"system:serviceaccount:pangeo-forge-azure-bakery:default\" cannot create resource \"services\" in API group \"\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"kind":"services"},"code":403}
Guessing I should try services in that list too?
t

Tyler Wanner

04/30/2021, 5:00 PM
you got it
c

ciaran

04/30/2021, 5:00 PM
Pffft. K8s whizz me
upvote 1
Okay, well the flow is running.
I think I'm now onto Dask config fun
🙏 1
t

Tyler Wanner

04/30/2021, 5:05 PM
that is where I tag out 🙂 other members of the team will be able to help more there
1
c

ciaran

04/30/2021, 5:06 PM
Thanks for the help again @Tyler Wanner! Are you DC based? I owe you a drink of some kind when I can eventually fly out to Development Seeds office!
🚀 1
Don't suppose you've got a name I can ping with regards to what seems to be a Dask Cluster just stuck in creation?
Oh, scrap that, I don't think it's dask
Copy code
message: '0/2 nodes are available: 2 Insufficient memory.'
Looks like I probably need to set adaptive scaling of some sorts.
👍 1
t

Tyler Wanner

04/30/2021, 5:09 PM
very DC, very based, I look forward to it!
🙌 1
yeah that's claiming that your dask cluster's resource requests are higher than any individual node has available for scheduling--you may need to increase your instance type
c

ciaran

04/30/2021, 5:10 PM
Hmmm the VMs in this block
Copy code
default_node_pool {
    name            = "default"
    node_count      = 2
    vm_size         = "Standard_D2_v2"
    os_disk_size_gb = 30
  }
Have 7GB and I'm asking for 4. I wonder if I've just used up the nodes I have? 🤷
enable_auto_scaling
, I should probably set that to true 😅
t

Tyler Wanner

04/30/2021, 5:17 PM
then yep should be able to just turn on auto scaling or increase your node count
c

ciaran

04/30/2021, 5:20 PM
Autoscaling sounds good to me 😄 I'll let you know how it goes
Hmmm. Turned on autoscaling and I still get
Copy code
reason: FailedScheduling
message: '0/1 nodes are available: 1 Insufficient memory.'
So i can have a maximum of 1000 nodes (all 7GB vms...), surely if it needs 4GB, that spins up a new node?
Oh, another error
Copy code
HTTP response headers: <CIMultiDictProxy('Audit-Id': '84d06158-c85b-4c3b-b093-6d0dc9884c5f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 30 Apr 2021 17:47:48 GMT', 'Content-Length': '398')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"poddisruptionbudgets.policy is forbidden: User \"system:serviceaccount:pangeo-forge-azure-bakery:default\" cannot create resource \"poddisruptionbudgets\" in API group \"policy\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"group":"policy","kind":"poddisruptionbudgets"},"code":403}
To the config!
Copy code
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - '*'
?
Been sat like this for 12 minutes 😧
t

Tyler Wanner

04/30/2021, 8:04 PM
any update?
c

ciaran

04/30/2021, 8:05 PM
After about 30 mins I gave up. There was no errors and no events in AKS that would point to something being wrong
It looked like it was just sitting there...
t

Tyler Wanner

04/30/2021, 8:05 PM
did you happen to check the dask root pod logs?
c

ciaran

04/30/2021, 8:06 PM
It didn't have any 😬
I'll double check this later (signing off for the weekend)
t

Tyler Wanner

04/30/2021, 8:07 PM
enjoy your weekend! ✌️
c

ciaran

04/30/2021, 8:08 PM
You too! Again, thanks for the help!
Okay, sorry it took forever to get back to this.
This is the events list that AKS has for my bakeries namespace (newest at the top) These warnings happened when I invoked the flow, it looks like that's solved as the dask-root pod is now green
However it has 0 logs
And doesn't appear to be scheduling anything
Here's all the pods currently running
The current state of the flow logs
Current flow state
k

Kevin Kho

05/04/2021, 4:41 PM
So it’s submitted for execution but not running?
c

ciaran

05/04/2021, 4:42 PM
Well, it says it's running
And the Dask Scheduler spins up
But then that's it
k

Kevin Kho

05/04/2021, 4:45 PM
This is still the same flow code so you’re expecting the
say_hello
?
c

ciaran

05/05/2021, 9:22 AM
Yep, pretty simple flow
Still going 🤣 Still 0 logs in the Dask Scheduler pod
k

Kevin Kho

05/05/2021, 1:33 PM
Hey @ciaran, can you repost a new thread in the community channel and then I can get more eyes on it?
c

ciaran

05/05/2021, 1:33 PM
Sure!
22 Views