Hi folks, I have a kubernetes infrastucture setup ...
# ask-community
d
Hi folks, I have a kubernetes infrastucture setup with namespace set to
pefect2-worker-dev
[image below]. When I try to execute a flow run in that infrastructure, it gives out the following error.
Copy code
Submission failed. kubernetes.client.exceptions.ApiException: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Audit-Id': 'e6<<SOME STUFF HERE>>37:14 GMT', 'Content-Length': '330'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch is forbidden: User \"system:serviceaccount:prefect2-worker-dev:prefect-worker\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"default\"","reason":"Forbidden","details":{"group":"batch","kind":"jobs"},"code":403}
I am unsure as to what actually determines which namespace the job is created in. [prefect v2.10.4 on python 3.10]
👀 1
d
How do i get the cluster UID? 😅
r
think you have to set it in helm
d
Unsure but the link you have posted seems like different issues. In my usecase, the worker is trying to create a job in the default NS and not the NS I have explicitly provided, which results in an exception because the SA has no access to default NS. My question is why is the SA trying to run the job in
default
NS and not the one I explicitly provided and how do I change it?
r
Yeah I am looking back when I had similar error messages (4/5 months) - at the time my fix was to create a ClusterRoleBinding for the service account
before they introduced what the link said
d
M checking prefect-helm for worker and it does have cluster role binding set by default . https://github.com/PrefectHQ/prefect-helm/blob/main/charts/prefect-worker/templates/rolebinding.yaml
r
I am sure @jawnsy will assist when he is around
👀 1
d
OK Ive run some tests. Deployed 2 set of workers in 2 NS. NS#1: helm_workers [Uses helm deployed workers] NS#2: test_workers [Uses prefect kubernetes manifest command worker] When I only have helm workers it always tries to run jobs in
default
NS no matter what I put in the KubernetesJob during deployment. The deployment being same, when I turn down the helm workers and switch to manifest worker. The SA automatically picks the NS specified in KubernetesJob during deployment. So either theres something wrong on the helm chart OR I am not deploying helm chart correctly. I use
helm install --namespace=prefect2-worker-dev  --values=./worker.dev.yaml prefect-worker prefect/prefect-worker
to deploy.
j
That’s very strange, the jobs that get created are controlled by the Kubernetes Worker (via KubernetesJob as you say), so as long as they’re set in the worker pool when creating it (and you’re not overriding it in the job you create), see screenshot, they should go to the correct namespace. What version of prefect & prefect-kubernetes do you have?
d
Prefect is 2.10.4
I don't think i use prefct Kubernetes
Yes i just checked my pyproject config doesn't have prefect Kubernetes specified
Copy code
from prefect.infrastructure.kubernetes import KubernetesJob
I am using this as infrastructure.
a
Hey @Deceivious, if you’re using the
KubernetesJob
infrastructure block, then you’ll need to use an agent instead of the Kubernetes worker. You can use the agent prefect helm chart to deploy an agent instead of the worker helm chart.
d
This is a bit confusing to me. New to kuber and helm. But isnt Prefect worker an native concept of Prefect and has nothing to do with Kubernetes ? If I can deploy prefect agent with helm, why is it that deploying worker with helm causes issue?
@alex
a
Workers and agents are Prefect specific concepts. Agents work with infrastructure blocks like the
KubernetesJob
block and poll for flow runs from
prefect-agent
typed work pools. Workers are a newer concept (they’re still in beta), but they are like an agent and an infrastructure block combined and poll for flow runs from typed work pools. All this means that you either need to use a agent + infra block or a worker.
d
Thanks @alex 🫡
s
@alex I'm facing similar issue in my machine where I'm not using any helm chart.can you suggest me a way forward to overcome this error. Submission failed. kubernetes.client.exceptions.ApiException: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Audit-Id': '85efeb11-33b6-4a58-91ad-d0a60bffce1a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': '513745c7-aa7c-4245-8349-ec7b488ba2ba', 'X-Kubernetes-Pf-Prioritylevel-Uid': '17e7873d-25c0-45c1-955e-4c4692a6bb21', 'Date': 'Fri, 21 Apr 2023 182518 GMT', 'Content-Length': '311'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch is forbidden: User \"systemserviceaccountprefect:prefect\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"default\"","reason":"Forbidden","details":{"group":"batch","kind":"jobs"},"code":403
d
@sjammula If you arent using the helm file but are using the output of
prefect kubernetes manifest agent
, you need to specify which name space to run it in [theres an cli parameter for it]. And when you deploy the flow, you need to ensure that the KubernetesJob has the same namespace.
Hi @alex, sorry to ping again. I am still a bit confused. I kinda get the difference between agent and worker. BUT, why is it that deploying the worker with
prefect kubernetes manifest
command works with the correct queue and the
helm
deployment fails. If both the workers are equivalent despite the deployment method, either both should work or both should fail. Unsure about the change in behavior based on deployment method.
a
The manifest that get’s generated from
prefect kubernetes manifest
is a manifest for agent deployment. We haven’t added a worker manifest to that CLI command yet. You might be deploying an agent with one method and a worker with the other method, depending on which helm chart you’re using.
🙌 1
d
THanks
r
Just so that I am clear what are the advantages of a worker in a k8s env?
👀 1
j
@alex would really appreciate some more detail around the differences between an agent and a worker?
Are there some docs?
j
@Joshua Greenhalgh Agents were our first generation system for collecting/running work. Workers are an updated version. If you’re just starting out today, use a worker. There are docs here: https://docs.prefect.io/2.10.12/concepts/work-pools/
upvote 1
a
To add on to what @jawnsy said, workers are scoped to a specific type of infrastructure and offer more customization compared to agents. Workers also have more observability features compared to agents when used with Prefect Cloud.
👍 1
j
Ok so I tried a worker and I got exactly the same issue as the OP - tried to create the Job in default ns but service account only has permissions for "prefect:?
j
When you create the worker, it will default to the ‘default’ namespace, you need to change that to ‘prefect’ or the namespace you deployed the worker into
Apologies that I haven’t been following this whole thread but I’ve created an issue for the default namespace problem here: https://github.com/PrefectHQ/prefect/issues/9845 This is something we can improve :)
r
@alex will agents be depreciated/removed any time soon or are we safe for a while
a
We have a 6 month deprecation period before removing functionality, so there will be plenty of notice before agents are removed.
👍 1
j
its not that the agent is in the wrong namespace its that the job thats created is in the wrong namespace?
I used raw manifests the following;
Copy code
---
# Source: prefect-worker/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prefect-worker
  namespace: "prefect"
---
# Source: prefect-worker/templates/role.yaml
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
  name: prefect-worker
  namespace: "prefect"
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log", "pods/status"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: [ "get", "list", "watch", "create", "update", "patch", "delete" ]
---
# Source: prefect-worker/templates/rolebinding.yaml
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: RoleBinding
metadata:
  name: prefect-worker
  namespace: "prefect"
roleRef:
  apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
  kind: Role
  name: prefect-worker
subjects:
  - kind: ServiceAccount
    name: prefect-worker
    namespace: "prefect"
---
# Source: prefect-worker/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prefect-worker
  namespace: "prefect"
  labels:
    app: prefect-worker
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prefect-worker
  template:
    metadata:
      labels:
        app: prefect-worker
    spec:
      serviceAccountName: prefect-worker
      securityContext:
        fsGroup: 1001
        runAsNonRoot: true
        runAsUser: 1001
      containers:
        - name: prefect-worker
          image: "prefecthq/prefect:2.10.12-python3.9-kubernetes"
          imagePullPolicy: IfNotPresent
          command:
            - prefect
            - worker
            - start
            - --type
            - kubernetes
            - --pool
            - default-agent-pool
            - --work-queue
            - default
          workingDir: /home/prefect
          env:
            - name: HOME
              value: /home/prefect
            - name: PREFECT_AGENT_PREFETCH_SECONDS
              value: "10"
            - name: PREFECT_AGENT_QUERY_INTERVAL
              value: "5"
            - name: PREFECT_API_ENABLE_HTTP2
              value: "true"
            - name: PREFECT_API_URL
              value: "<http://host.docker.internal:4200/api>"
            - name: PREFECT_KUBERNETES_CLUSTER_UID
              value: ""
            - name: PREFECT_DEBUG_MODE
              value: "false"
          resources:
            limits:
              cpu: 1000m
              memory: 1Gi
            requests:
              cpu: 100m
              memory: 256Mi
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 1001
          volumeMounts:
            - mountPath: /home/prefect
              name: scratch
              subPathExpr: home
            - mountPath: /tmp
              name: scratch
              subPathExpr: tmp
      volumes:
        - name: scratch
          emptyDir: {}
working on a local minikube at the momentr
I assume this is the line thats causing the job to run in default? https://github.com/PrefectHQ/prefect-kubernetes/blob/8c33171a7dbe1e2cd304162fcd1331d48cb5248d/prefect_kubernetes/worker.py#L229 - I just have no idea how I am supposed to override this? It should be an arg I can pass to the worker no?
This is where the job gets created - https://github.com/PrefectHQ/prefect-kubernetes/blob/8c33171a7dbe1e2cd304162fcd1331d48cb5248d/prefect_kubernetes/worker.py#LL622C23-L622C57 - it passes in
configuration.namespace
- I need that to be different from 'default' 😞
tried setting here;
Copy code
infra = KubernetesJob(
        image=f"{BASE_IMAGE}:production",
        image_pull_policy=IMAGE_PULL_POLICY,
        namespace="prefect",
    )
still get;
Copy code
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch is forbidden: User \"system:serviceaccount:prefect:prefect-worker\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"default\"","reason":"Forbidden","details":{"group":"batch","kind":"jobs"},"code":403}
r
@Joshua Greenhalgh you will need ClusterRole+ClusterRoleBinding
j
this is all in place I have all the roles correct but the roles give permission in the prefect namespace which is where I want the jobs to run - but the jobs are started in default which the worker does not have permission to do?
r
ah ok, you might need an infra_overrides["namespace"] on the deployment
j
Yeah that did something but still not quite there... 😞 - thanks - really finding this move to V2 very very difficult
r
show me your deployment if you can
j
@redsquare - shall setup a mini repo and share
👍 1
r
looks the same as mine - if you add output = 'deployment_build_output.yaml' to the deployment do you see the correct namespace in the generated file
j
I think I have half solved it (more issues though 😞 ) but I feel this is completely undocumented (outside looking in the source https://github.com/PrefectHQ/prefect-kubernetes/blob/8c33171a7dbe1e2cd304162fcd1331d48cb5248d/prefect_kubernetes/worker.py#L684)? So the manifests come from the helm chart basically - if you don't specify "PREFECT_KUBERNETES_CLUSTER_UID" - it attempts to construct a UID from something that lives in the kube-system - but the manifests do not construct a role that can read that namespace - so I have just generated a uuid and set that value in the manifest - I am hoping it just needs to be any unique ID? Now the pod runs! However it can't find my flow inside the container but think I can probably work this out...
r
cool - yeah your path probably wrong
j
Thanks for the help!
r
good luck with it all
j
The background is that we need a way to uniquely identify clusters in order to support cancellation, but there’s no great general way to do that in Kubernetes. The idea to use the UID of the kube-system namespace came from Tim Hockin (one of the early Kubernetes maintainers, who did a lot of the networking stuff). We do the lookup during helm install time to avoid needing to grant the service account cluster-wide read permissions on namespaces, which would be necessary for the code-based lookup to work. Something we didn’t anticipate when implementing this feature is that some systems like ArgoCD or when running
helm template
won’t actually run the lookup, but also won’t emit an error, it’ll just return an empty value instead. So the worker has no choice but to try to look up the UID at runtime, which fails due to missing permissions, which we also don’t want to add (by default we don’t add a ClusterRole or ClutserRoleBinding for the worker service account) This is definitely a decision we should revisit, though! Sorry about the rough edges here
j
NP - it just needs to be some random id then? So self generated uuid4 is fine?
And yeah I used template to get the manifests - don't want to have to hook up helm to me terrafrom setup really
j
Yeah, I think it just needs to be unique for each cluster you’re running an agent in
j
thanks!
j
I wrote up an issue here, however, I think it’s low priority because the lookup alright for most users. Feel free to add a comment if you disagree, it helps us prioritize https://github.com/PrefectHQ/prefect/issues/9851
j
All I would add is that k8s api error to the issue - then it's probably findable;
Copy code
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"namespaces \"kube-system\" is forbidden: User \"system:serviceaccount:prefect:prefect-worker\" cannot get resource \"namespaces\" in API group \"\" in the namespace \"kube-system\"","reason":"Forbidden","details":{"name":"kube-system","kind":"namespaces"},"code":403}
👍 1