Hey folks! Does anyone have an example repo they c...
# ask-community
c
Hey folks! Does anyone have an example repo they could share that's handling the deployment of a Prefect Agent with AKS? Currently starting off with spinning up a cluster with Terraform, but my k8s skills are sub-par so some useful starters would be handy. For reference i've CDK'd up a Prefect Agent & respective cluster in ECS on AWS before, but that wasn't involving k8s.
t
Hiya! Did you check out the cli command? i’ll grab the doc. I’ve also used terraform to deploy agents myself but prefect agent kubernetes install will generate a yaml manifest for you
c
Hey @Tyler Wanner yeah I got to the CLI part and then wanted to see if anyone had done it in a IAC manner - Is there a certain terraform provider that can run that manifest?
t
there’s not a fully supported terraform provider for handling raw yaml at the moment but I can share with you an example agent terraform config if you’d like
c
If you could that'd be amazing
I'm out of my depth in Azure and k8s 🤣 AWS is my happy place
t
well with Prefect and AKS you shouldn't need to worry too too much about k8s to get going!
this isn't a "supported" install pattern so no guarantees but here's an all-in example that will create a prefect agent and the rbac as if you used
prefect agent kubernetes install
Copy code
resource "kubernetes_namespace" "ci" {
  metadata {
    name = "prefect"
  }
}

resource "kubernetes_role" "prefect_agent" {
  metadata {
    name      = "prefect-agent"
    namespace = kubernetes_namespace.ci.metadata[0].name
  }

  rule {
    api_groups = ["batch", "extensions"]
    resources  = ["jobs"]
    verbs      = ["*"]
  }
  rule {
    api_groups = [""]
    resources  = ["events", "pods"]
    verbs      = ["*"]
  }
}

resource "kubernetes_role_binding" "prefect_agent" {
  metadata {
    name      = "prefect-agent"
    namespace = kubernetes_namespace.ci.metadata[0].name
  }

  role_ref {
    api_group = "<http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>"
    kind      = "Role"
    name      = kubernetes_role.prefect_agent.metadata[0].name
  }

  subject {
    kind      = "ServiceAccount"
    name      = kubernetes_service_account.agent.metadata[0].name
    namespace = kubernetes_namespace.ci.metadata[0].name
  }
}

resource "kubernetes_service_account" "agent" {
  metadata {
    name      = "agent"
    namespace = kubernetes_namespace.ci.metadata[0].name
  }
}



resource "kubernetes_deployment" "deployment" {
  metadata {
    name      = <http://var.app|var.app>
    namespace = kubernetes_namespace.ci.metadata[0].name
  }

  spec {
    replicas = "1"

    selector {
      match_labels = {
        app = <http://var.app|var.app>
      }
    }

    template {
      metadata {
        labels = {
          app = <http://var.app|var.app>
        }
      }

      spec {
        service_account_name            = kubernetes_service_account.agent.metadata[0].name
        automount_service_account_token = true

        container {
          args    = ["prefect agent kubernetes start"]
          command = ["/bin/bash", "-c"]

          env {
            name  = "PREFECT__CLOUD__AGENT__AUTH_TOKEN"
            value = var.auth_token
          }

          env {
            name  = "PREFECT__CLOUD__AGENT__AGENT_ADDRESS"
            value = "http://:8080"
          }

          env {
            name  = "NAMESPACE"
            value = kubernetes_namespace.ci.metadata[0].name
          }

          env {
            name  = "PREFECT__CLOUD__AGENT__LABELS"
            value = "['foo']"
          }

          dynamic env {
            for_each = var.env_vars
            content {
              name  = env.name
              value = env.value
            }
          }

          image             = "prefecthq/prefect:${var.prefect_version}"
          name              = <http://var.app|var.app>
          image_pull_policy = "Always"

          liveness_probe {
            http_get {
              path = "/api/health"
              port = 8080
            }

            failure_threshold     = 2
            initial_delay_seconds = 40
            period_seconds        = 40
          }

          resources {
            limits {
              cpu    = "500m"
              memory = "128Mi"
            }
          }
        }
      }
    }
  }
}

variable "auth_token" {}
variable "app" { default = "prefect-agent" }
variable "prefect_version" { default = "latest" }
variable "env_vars" { 
    type = map 
    default = null
}
👀 1
mind the agent configuration, especially the addition of the "foo" label (which will probably not pick up any of your flows, unless that label is present on the flow/ flow run)
you'll need to supply an auth_token, for which you'll want to use a Prefect Cloud serviceaccount api key
Also I do believe we've removed the resources block from the generic install template. It's best to set them at a proper level, but you may just want to remove them to get started
c
Cool thanks for this! I'll take a look!
👍 1
t
let me know how it goes!
btw I left out the k8s provider configuration... for that, reference the provider docs directly, as your interaction pattern will determine how you set that up https://registry.terraform.io/providers/hashicorp/kubernetes/1.11.0/docs
c
So I guess the alternative to your TF example is deploying the AKS cluster with TF then just using
kubectl
to apply that manifest?
@Tyler Wanner off-topic, but I'm pretty sure I watched a demo you did on Youtube earlier this week 🤣
🙌 1
t
yep @ciaran the easiest way to deploy the agent is
prefect kubernetes agent install --rbac --namespace NAMESPACE -t TOKEN | kubectl apply -n NAMESPACE -f -
🤯 1
personally, i'm a big fan of declarative infrastructure code so I use a mixture of both to manage my k8s prefect agents
c
Awesome thanks, appreciate the help! Yeah coming from CDK I'm definitely learning towards preferring doing this in Terraform
Just wanted to get my head around what that's doing under the hood too
t
well then you're asking all the right questions 👍
c
So, slightly dim question, your TF example, how does it get placed into the AKS cluster? I don't see a reference to a cluster
t
that's part of the kubernetes provider configuration
you can either inherit your local kube context or set up a link to a particular cluster
c
Interesting. Ah I see so actually the deployment of my AKS cluster is separate to my Prefect Agent setup? I'll need two terraform 'projects' kind of
t
in my experience, yes, but I'm not sure that's 100% true
If the provider configuration is dependent upon that cluster's state (it is the way I do it) or existence, then you'll be much better off separating them
c
Whelp beat me to it.
Eurgh. Forgot how chicken and egg terraform was. Cloudformation spoils me
So @Tyler Wanner I'm trying to run:
Copy code
prefect agent kubernetes install -t "<token>" --rbac -n "pangeo-forge-azure-bakery" -l "ciaran-dev" | kubectl apply -f --namespace=pangeo-forge-azure-bakery -
Based on https://docs.prefect.io/orchestration/agents/kubernetes.html#running-in-cluster But I'm getting:
Copy code
error: Unexpected args: [-]
If I remove the
-
I instead get:
Copy code
error: the path "--namespace=pangeo-forge-azure-bakery" does not exist
t
the - needs to be an argument passed to -f
can u try moving -f to after —namespace?
if that works then we’ll need to fix the docs
c
Copy code
Warning: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1> RoleBinding is deprecated in v1.17+, unavailable in v1.22+; use <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1> RoleBinding
Error from server (NotFound): error when creating "STDIN": namespaces "pangeo-forge-azure-bakery" not found
Error from server (NotFound): error when creating "STDIN": namespaces "pangeo-forge-azure-bakery" not found
Error from server (NotFound): error when creating "STDIN": namespaces "pangeo-forge-azure-bakery" not found
Bit further haha
t
did you create the namespace?
c
Oh, does it need to be created beforehand?
I assumed this did it also
t
if you’re plopping into a non-default namespace you’ll have to create it yea
🙌 1
c
TIL
t
this is because the prefect CLI doesn’t know anything about your kubernetes environment it’s just generating a yaml manifest
so we can’t assume to create one
c
That's fair enough. Annoyingly it looks like
azurerm_kubernetes_cluster
doesn't offer that option.
So
kubernetes
provider/kubectl it is
t
Copy code
kubectl create namespace NAMESPACE
will do it
the cluster resource will not provide you with an interface for configuring namespaces, surely
c
🤷 Probably obvious but k8s is new to me haha
t
not much about k8s is “obvious” but once u get over the learning curve, it’s great magic
🤣 1
fortunately you won’t need to know much more than this for your prefect agent to be able to make good use of it but surely you will learn through debugging things that arise inevitably
c
🦜 Wahahay
🚀 1
marvin 1
t
happy k8sing!
c
Many thanks! Really appreciate it
Shall I raise an issue about the docs?
t
you may if you’d like, and I’ll take care of it
or feel free to submit a PR if you’d like to contribute
c
👍 Oh when in Rome.
🙌 1
I'll raise a PR
t
awesome thanks for that @ciaran 🙏
c
Hey @Tyler Wanner sorry for reviving this, I've been trying run a very simple flow and I'm hitting this:
Copy code
(403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'af45fea8-a5be-4d4a-a50c-fc8875a83144', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 23 Apr 2021 12:24:44 GMT', 'Content-Length': '329'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch is forbidden: User \"system:serviceaccount:default:default\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"group":"batch","kind":"jobs"},"code":403}
The yaml I'm applying to the cluster looks like:
Copy code
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: prefect-agent
  name: prefect-agent
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prefect-agent
  template:
    metadata:
      labels:
        app: prefect-agent
    spec:
      containers:
      - args:
        - prefect agent kubernetes start
        command:
        - /bin/bash
        - -c
        env:
        - name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
          value: ${PREFECT__CLOUD__AGENT__AUTH_TOKEN}
        - name: PREFECT__CLOUD__API
          value: <https://api.prefect.io>
        - name: NAMESPACE
          value: ${BAKERY_NAMESPACE}
        - name: IMAGE_PULL_SECRETS
          value: ''
        - name: PREFECT__CLOUD__AGENT__LABELS
          value: '${PREFECT__CLOUD__AGENT__LABELS}'
        - name: JOB_MEM_REQUEST
          value: ''
        - name: JOB_MEM_LIMIT
          value: ''
        - name: JOB_CPU_REQUEST
          value: ''
        - name: JOB_CPU_LIMIT
          value: ''
        - name: IMAGE_PULL_POLICY
          value: ''
        - name: SERVICE_ACCOUNT_NAME
          value: ''
        - name: PREFECT__BACKEND
          value: cloud
        - name: PREFECT__CLOUD__AGENT__AGENT_ADDRESS
          value: http://:8080
        image: prefecthq/prefect:0.14.16-python3.8
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /api/health
            port: 8080
          initialDelaySeconds: 40
          periodSeconds: 40
        name: agent
---
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
  name: prefect-agent-rbac
  namespace: default
rules:
- apiGroups:
  - batch
  - extensions
  resources:
  - jobs
  verbs:
  - '*'
- apiGroups:
  - ''
  resources:
  - events
  - pods
  verbs:
  - '*'
---
apiVersion: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>
kind: RoleBinding
metadata:
  name: prefect-agent-rbac
  namespace: default
roleRef:
  apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
  kind: Role
  name: prefect-agent-rbac
subjects:
- kind: ServiceAccount
  name: default
And my
KubernetesRun
config looks like:
Copy code
run_config=KubernetesRun(
        image="prefecthq/prefect:0.14.16-python3.8",
        labels=json.loads(os.environ["PREFECT__CLOUD__AGENT__LABELS"]),
    ),
Any ideas? Appreciate it!
t
ah yep that's a missing RBAC permission
Copy code
"Failure","message":"jobs.batch is forbidden: User \"system:serviceaccount:default:default\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"pangeo-forge-azure-bakery\""
^^ this is saying that something in the default namespace, with the serviceaccount default, cannot create jobs in the namespace pangeo-forge-azure-bakery
is this log in the agent?
c
I think I've not templated the namespace name correctly.
t
that's certainly the right vein--let me check out the full file real quick
it looks like your agent deployment is still in the default namespace--did you mean to run it in your bakery namespace with the jobs it creates?
c
Yeah I'll be honest I think what I've done is used prefect to generate that yaml file, then not fully templated out the bakery namespace.
t
as long as you apply with -n NAMESPACE you should be ok
otherwise, just add
namespace: pangeo-forge-azure-bakery
on line 8
c
Inside the
spec
section?
t
no sorry, between line 7 and 8, same as in the other resources--in metadata
c
gotcha
t
your original file before that PR you linked should work if you just specify -n NAMESPACE in the kubectl apply command btw
but now you're explicitly stating htat in the configuration
c
Ahh okay. Yeah I was hoping to get the overall template and then folks can dynamically set it. Mainly because there could be multiple deployments etc.
t
OK then don't make these changes, and just change the apply command
c
So, progress.
Copy code
Failed to load and execute Flow's environment: AttributeError("'NoneType' object has no attribute 'rstrip'")
t
hmm i'm not sure on that one and i'll have to step away a while but glad to get past that one
c
Okay no problem thanks for the help so far!
t
can you check out the version of prefect that built the flow and the version of prefect in the flow storage image?
it sounds like a version mismatch
c
Both the agent and flow are set to use
prefecthq/prefect:0.14.16-python3.8
and my local install is also
0.14.16
on Python
3.8.6
🤔 1
t
paging @Kevin Kho
☎️ 1
c
The only thing that may be different is I installed
prefect[azure, kubernetes]
locally...
k
Hi @ciaran! How big is your flow code?
Do you think you can share it?
k
How did you register this flow? Running the Python script?
c
Yep, just running it with python
k
My best advice is to try Github storage first to determine if it’s a storage problem. I am also wondering if your labels comes out as a list of strings. Could you check that?
c
It's available in the agent under the same env var name
(I had found that thread 😅 )
I'll double check my labels.
k
Github storage will help us check if it’s an Azure storage issue
c
Copy code
>>> import json
>>> import os
>>> json.loads(os.environ["PREFECT__CLOUD__AGENT__LABELS"])
['ciarandev']
Tried it in the python interpreter, looks like a list.
k
that looks good
c
I might have to try the Github storage monday. Finishing up for the day here 😅 Thanks for the pointers though, I'll push with Github storage first thing monday/this weekend if I get a chance and let you know!
k
Sure thing!
c
This was much more in my comfort zone on AWS and ECS 🤣 Feel like Azure & k8s is polar opposite lol
So just about to try this out, I'm assuming that
access_token_secret
in https://docs.prefect.io/api/latest/storage.html#github is only necessary for private repositories?
Gave it a go without the
access_token_secret
. I get:
Which makes sense as this value would be serialised on a non-repo storage target. So looks like it can at least download and attempt to run the flow via Github
So I'm guessing this is pointing us to Azure Storage being an issue
k
Hey @ciaran, that does seem to be the case. I would make sure that the values are working. I wanna mention that I literally handed someone your aws recipes code for them to understand the ECS task definition.
c
Haha oh goodness. I hope they found it handy! By making sure the values are working, what do you mean?
k
That the storage connection string works outside of Prefect
c
Ah right. Got you
k
I’ll be able to try your flow later today do help diagnose.
c
Okay, for reference I ran:
Copy code
>>> con_string = "<the con string>"
>>> from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
>>> blob_service_client = BlobServiceClient.from_connection_string(con_string)
>>> for container in blob_service_client.list_containers():
...     print(container)
... 
{'name': 'ciarandev-bakery-flow-storage-container', 'last_modified': datetime.datetime(2021, 4, 26, 8, 14, 17, tzinfo=datetime.timezone.utc), 'etag': '"0x8D9088B4947D0D2"', 'lease': {'status': 'unlocked', 'state': 'available', 'duration': None}, 'public_access': None, 'has_immutability_policy': False, 'deleted': None, 'version': None, 'has_legal_hold': False, 'metadata': None, 'encryption_scope': <azure.storage.blob._models.ContainerEncryptionScope object at 0x106ce0f10>}
So looks like the connection string is correct.
k
I guess it might be a Docker container and K8s? Is your container for this example minimal?
c
Oh well actually we knew the connection string worked as it successfully stores the flow, it just falls over running it.
The container is literally just the
prefecthq/prefect:0.14.16-python3.8
image
k
Perfect. Will test later.
c
Cool thanks, appreciate it!
And so that it's beside the links, this is the error I get:
Copy code
Failed to load and execute Flow's environment: AttributeError("'NoneType' object has no attribute 'rstrip'")
👍 1
k
I replicated this on AzureStorage + LocalRun. Trying to figure out how to fix now.
The issue here is that the
AzureStorage
class does not store the connection_string needed to retrieve your flow. This might be something we need to fix on our end (I raised the issue to the team). In the meantime, you need to pass an environment variable to get around this.
Copy code
with Flow(
    "azure_flow",
    run_config=LocalRun(env={"AZURE_STORAGE_CONNECTION_STRING": connection_string}),
    storage=storage.Azure(
        container="test",
        connection_string=connection_string,
    )
) as flow:
    hello_result = say_hello()
c
Hey @Kevin Kho thanks for looking into this! So do I even need to provide
connection_string
to
storage.Azure
? If the environment variable is the one it uses?
Awesome, so adding that as an environment variable to the run config for K8s does seem to work. Be interested still in whether it's actually needed elsewhere
k
You need it in Azure storage because that’s for uploading
c
Ah Ok. I assumed that also covered the downloading
k
Well it does for other storage classes. We’ll be opening an issue for this.
c
Ahh gotcha.
Whilst I have this Big 'Ol Thread - Do you have any folks that know of a nice way to deal with Prefect & AKS logs? The process of finding the logs for just 1 flow run is quite janky at the mo (and it's likely not a Prefect issue)
k
You work with Sean right? Yeah, unfortunately this is a Dask issue where Dask doesn’t natively move the logs around. Attaching the CloudHandler to the logger before it’s sent to Dask workers doesn’t quite work because the logger gets reinstantiated on the worker side. You need some kind of service to write the logs to another location and collect it from there.
c
So actually I think this is a slightly different, AKS specific issue But yeah I work with Sean, we're putting y'all in your paces 😅
k
Ah I haven’t used AKS. What do you think is the issue? I can take a look some time.
1
c
For reference, this is how I'm managing to view the logs for the flow I ran: https://github.com/pangeo-forge/pangeo-forge-azure-bakery/tree/add-k8s-cluster#logging
Essentially unlike something like ECS where a flow is a task and I can click through to the exact Cloudwatch logs, with the Azure logging, I have to do some janky queries to get the container ID of the pod that ran the flow then query for the logs for that ID
But given I've very little AKS experience, I'm probably just doing it a really backwards way lol
k
Oh ok I see what you mean. Yeah, outside of Prefect, but I have a bit of Azure experience in general. Will take a quick look and see what I find sometime.
c
Cool thanks, no rush as it's only really a pain if something goes wrong and I need to dig into the AKS side of the logging
👍 1
It might just be a 'thing' 🤷
So, trying out setting up DaskExecutor on my Kubernetes flows
Copy code
def get_cluster():
    pod_spec = make_pod_spec(
        image="prefecthq/prefect:0.14.16-python3.8",
        labels={"flow": flow_name},
        memory_limit='4G',
        memory_request='4G'
    )
    return KubeCluster(pod_spec)

...


executor=DaskExecutor(
    cluster_class=get_cluster(),
)
Gives me:
Copy code
Traceback (most recent call last):
  File "flow_test/manual_flow.py", line 8, in <module>
    from dask_kubernetes import KubeCluster, make_pod_spec
  File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/__init__.py", line 3, in <module>
    from .core import KubeCluster
  File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/core.py", line 19, in <module>
    from .objects import (
  File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/objects.py", line 34, in <module>
    SERIALIZATION_API_CLIENT = DummyApiClient()
  File "/Users/ciaran/Library/Caches/pypoetry/virtualenvs/pangeo-forge-azure-bakery-IMqFot_V-py3.8/lib/python3.8/site-packages/dask_kubernetes/objects.py", line 28, in __init__
    self.configuration = Configuration.get_default_copy()
AttributeError: type object 'Configuration' has no attribute 'get_default_copy'
k
I think
cluster_class
should be callable. Can you try removing the
()
?
But I think the ideal situation is you put in
KubeCluster
there and use the
cluster_kwargs
to pass the pod spec.
c
Hmmm okay, what shape would the pod spec look like in
cluster_kwargs
?
Copy code
executor=DaskExecutor(
    cluster_class="dask_kubernetes.KubeCluster",
    cluster_kwargs={
        "image": "prefecthq/prefect:0.14.16-python3.8",
        "labels": {"flow": flow_name},
        "memory_limit": "4G",
        "memory_request": "4G"
    }
)
Gives the same error
k
This seems right. Can you try passing the class directly instead of a string? (probably will be the same though)
Am I reading it right that the error is from the port though? I would check the versioning where it was registered and where the agent is running.
c
I should point out this is during registration
Directly vs string didn't do much
Huh
Now it just fails even with it commented out.
Oh it's failing on the
from dask_kubernetes import KubeCluster, make_pod_spec
line
k
Oh that’s what I meant from the last message. Yeah it’s failing on the import so it seems like something is wrong with Kubecluster independent of Prefect
c
Gotcha.
Copy code
dask-kubernetes 2021.3.0
Ah. Looks like
prefect[kubernetes]
is installing an older version
Copy code
kubernetes                           11.0.0b2
So I guess either I bump Kubernetes, or Prefect does 😅
k
I see
c
Shall I raise an issue @Kevin Kho?
k
Is bringing kubernetes down to that an option for your use case?
c
So the error I'm seeing is because Prefect is installing a older version of Kubernetes.
dask-kubernetes
has that error if k8s is less than
v12
, but
dask-kubernetes
was installed via Prefect
So Prefect is currently installing 1 lib that relies on another it installs, but the versions aren't compatible
So maybe a lower version of
dask-kubernetes
might work, but it feels like we shouldn't drop versions down
k
Oh I see. You can open an issue for sure but i’m sure you’re aware that may take a while to complete due to compatibility testing
I get what you mean now about bumping it up.
c
Sure, I mean I'll install a newer k8s version locally to see if that fixes it
If it does, then I can at least recommend extending the versions of
kubernetes
that
prefect
installs.
👍 1
@Kevin Kho I had to manually bump
kubernetes
to
12.0.1
via pip because Poetry wouldn't let me do it via that (as Prefects range doesn't include it). However, when I uncommented out those dask_kubernetes imports, the flow registration happened without error. So I think there's definitely evidence of Prefect pulling in a Kubernetes version that is incompatible with the
dask_kubernetes
version it pulls in.
I've raised https://github.com/PrefectHQ/prefect/issues/4451 which hopefully contains enough information
k
Thanks for raising @ciaran! and for helping Ben!
1
c
👋 me again 😅 So, I manually installed
kubernetes==12.0.1
and made a image for my agents/pods that is just the
prefect==0.14.17
image, with the newer k8s version installed
Managed to use
KubeCluster
in my Flow registration
Running it, I get:
Copy code
HTTP response headers: <CIMultiDictProxy('Audit-Id': 'e60cf00e-93e6-4fb0-802e-75a298fa0867', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 30 Apr 2021 16:01:26 GMT', 'Content-Length': '386')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"dask-root-857b2f2e-0k55nf\" is forbidden: User \"system:serviceaccount:pangeo-forge-azure-bakery:default\" cannot get resource \"pods/log\" in API group \"\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"name":"dask-root-857b2f2e-0k55nf","kind":"pods"},"code":403}
Guessing I've hit more k8s errors...
@Tyler Wanner probably your territory this one!
t
heya--we shipped a PR to master about the kubernetes version btw
1
that's definitely a kubernetes RBAC permissioning problem
your dask root pod is using pangeo-forge-azure-bakery/default as a serviceaccount, can you tell me what your agent is using?
not sure what RBAC you have permissioned in your cluster at this point, but you could always just give that exact permission to that exact serviceaccount (I can walk you through that if you're not following) and see what happens
c
Here's the RBAC I applied when I created the agent:
Copy code
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
  name: prefect-agent-rbac
  namespace: ${BAKERY_NAMESPACE}
rules:
- apiGroups:
  - batch
  - extensions
  resources:
  - jobs
  verbs:
  - '*'
- apiGroups:
  - ''
  resources:
  - events
  - pods
  verbs:
  - '*'
---
apiVersion: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>
kind: RoleBinding
metadata:
  name: prefect-agent-rbac
  namespace: ${BAKERY_NAMESPACE}
roleRef:
  apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
  kind: Role
  name: prefect-agent-rbac
subjects:
- kind: ServiceAccount
  name: default
t
hmm it seems you have the pods/* permission there
c
If it helps, this what I currently have setup in the flow (missing a few bits like imports/tasks etc)
Nothing too wild I don't think
t
yeah namespaced RBAC gets pretty hairy at times, but I don't think it's anything in your flow--the error message and the RBAC permissions just don't seem to add up
c
😬
t
can you
kubectl get
your rolebinding and role just to make sure they're as-set
I must be missing something
c
I can certainly do that. What's the syntax for that kind of command 😅
t
kubectl get rolebindings -n $BAKERY_NAMESPACE -o yaml
c
Ah nice, Thanks
Copy code
apiVersion: v1
items:
- apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
  kind: RoleBinding
  metadata:
    annotations:
      <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
        {"apiVersion":"<http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>","kind":"RoleBinding","metadata":{"annotations":{},"name":"prefect-agent-rbac","namespace":"pangeo-forge-azure-bakery"},"roleRef":{"apiGroup":"<http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>","kind":"Role","name":"prefect-agent-rbac"},"subjects":[{"kind":"ServiceAccount","name":"default"}]}
    creationTimestamp: "2021-04-30T15:33:05Z"
    managedFields:
    - apiVersion: <http://rbac.authorization.k8s.io/v1beta1|rbac.authorization.k8s.io/v1beta1>
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            <f:kubectl.kubernetes.io/last-applied-configuration>: {}
        f:roleRef:
          f:apiGroup: {}
          f:kind: {}
          f:name: {}
        f:subjects: {}
      manager: kubectl-client-side-apply
      operation: Update
      time: "2021-04-30T15:33:05Z"
    name: prefect-agent-rbac
    namespace: pangeo-forge-azure-bakery
    resourceVersion: "3887"
    selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/pangeo-forge-azure-bakery/rolebindings/prefect-agent-rbac
    uid: 5e1207cb-10ab-4974-b2b6-91901c2d9f44
  roleRef:
    apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
    kind: Role
    name: prefect-agent-rbac
  subjects:
  - kind: ServiceAccount
    name: default
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
t
ok now can you try
kubectl get roles -n $BAKERY_NAMESPACE prefect-agent-rbac -o yaml
c
Copy code
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
  annotations:
    <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
      {"apiVersion":"<http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>","kind":"Role","metadata":{"annotations":{},"name":"prefect-agent-rbac","namespace":"pangeo-forge-azure-bakery"},"rules":[{"apiGroups":["batch","extensions"],"resources":["jobs"],"verbs":["*"]},{"apiGroups":[""],"resources":["events","pods"],"verbs":["*"]}]}
  creationTimestamp: "2021-04-30T15:33:05Z"
  managedFields:
  - apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          <f:kubectl.kubernetes.io/last-applied-configuration>: {}
      f:rules: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2021-04-30T15:33:05Z"
  name: prefect-agent-rbac
  namespace: pangeo-forge-azure-bakery
  resourceVersion: "3885"
  selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/pangeo-forge-azure-bakery/roles/prefect-agent-rbac
  uid: 5adabb97-9970-4634-9f82-dcb4bc91d801
rules:
- apiGroups:
  - batch
  - extensions
  resources:
  - jobs
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - events
  - pods
  verbs:
  - '*'
t
ah wow i see what we're missing
pods/log
is not actually covered by
pods
c
Ooh.
t
Copy code
resources:
  - events
  - pods
  - pods/log
^^ try that in your role
c
running
🚀 1
🤣 1
Whahay, we're getting there! Another error mind:
Copy code
HTTP response headers: <CIMultiDictProxy('Audit-Id': '80b5f484-e14a-4097-94cb-f3339f4d1356', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 30 Apr 2021 16:57:23 GMT', 'Content-Length': '332')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"services is forbidden: User \"system:serviceaccount:pangeo-forge-azure-bakery:default\" cannot create resource \"services\" in API group \"\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"kind":"services"},"code":403}
Guessing I should try services in that list too?
t
you got it
c
Pffft. K8s whizz me
upvote 1
Okay, well the flow is running.
I think I'm now onto Dask config fun
🙏 1
t
that is where I tag out 🙂 other members of the team will be able to help more there
1
c
Thanks for the help again @Tyler Wanner! Are you DC based? I owe you a drink of some kind when I can eventually fly out to Development Seeds office!
🚀 1
Don't suppose you've got a name I can ping with regards to what seems to be a Dask Cluster just stuck in creation?
Oh, scrap that, I don't think it's dask
Copy code
message: '0/2 nodes are available: 2 Insufficient memory.'
Looks like I probably need to set adaptive scaling of some sorts.
👍 1
t
very DC, very based, I look forward to it!
🙌 1
yeah that's claiming that your dask cluster's resource requests are higher than any individual node has available for scheduling--you may need to increase your instance type
c
Hmmm the VMs in this block
Copy code
default_node_pool {
    name            = "default"
    node_count      = 2
    vm_size         = "Standard_D2_v2"
    os_disk_size_gb = 30
  }
Have 7GB and I'm asking for 4. I wonder if I've just used up the nodes I have? 🤷
enable_auto_scaling
, I should probably set that to true 😅
t
then yep should be able to just turn on auto scaling or increase your node count
c
Autoscaling sounds good to me 😄 I'll let you know how it goes
Hmmm. Turned on autoscaling and I still get
Copy code
reason: FailedScheduling
message: '0/1 nodes are available: 1 Insufficient memory.'
So i can have a maximum of 1000 nodes (all 7GB vms...), surely if it needs 4GB, that spins up a new node?
Oh, another error
Copy code
HTTP response headers: <CIMultiDictProxy('Audit-Id': '84d06158-c85b-4c3b-b093-6d0dc9884c5f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 30 Apr 2021 17:47:48 GMT', 'Content-Length': '398')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"poddisruptionbudgets.policy is forbidden: User \"system:serviceaccount:pangeo-forge-azure-bakery:default\" cannot create resource \"poddisruptionbudgets\" in API group \"policy\" in the namespace \"pangeo-forge-azure-bakery\"","reason":"Forbidden","details":{"group":"policy","kind":"poddisruptionbudgets"},"code":403}
To the config!
Copy code
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - '*'
?
Been sat like this for 12 minutes 😧
t
any update?
c
After about 30 mins I gave up. There was no errors and no events in AKS that would point to something being wrong
It looked like it was just sitting there...
t
did you happen to check the dask root pod logs?
c
It didn't have any 😬
I'll double check this later (signing off for the weekend)
t
enjoy your weekend! ✌️
c
You too! Again, thanks for the help!
Okay, sorry it took forever to get back to this.
This is the events list that AKS has for my bakeries namespace (newest at the top) These warnings happened when I invoked the flow, it looks like that's solved as the dask-root pod is now green
However it has 0 logs
And doesn't appear to be scheduling anything
Here's all the pods currently running
The current state of the flow logs
Current flow state
k
So it’s submitted for execution but not running?
c
Well, it says it's running
And the Dask Scheduler spins up
But then that's it
k
This is still the same flow code so you’re expecting the
say_hello
?
c
Yep, pretty simple flow
Still going 🤣 Still 0 logs in the Dask Scheduler pod
k
Hey @ciaran, can you repost a new thread in the community channel and then I can get more eyes on it?
c
Sure!
135 Views