Hi team, We're encountering an error with our Pre...
# prefect-cloud
k
Hi team, We're encountering an error with our Prefect flows deployed in our Kubernetes cluster using Prefect Cloud. Here are the details: Environment: - Prefect Cloud - Kubernetes cluster for flow deployment YAML Configuration:
Copy code
spec:
      serviceAccountName: maestro
      containers:
        - image: docker-local.artifactory.internal/prefect
          name: maestro
          resources:
            requests:
              memory: 2Gi #TODO
              cpu: 1000m #TODO
            limits:
              memory: 3Gi #TODO
              cpu: 4000m #TODO
          command: ["/bin/bash", "-c"]
          args:
            - |
              prefect config set PREFECT_API_URL=$PREFECT_API_URL
              prefect config set PREFECT_API_KEY=$PREFECT_API_KEY
              prefect deploy --all
              prefect worker start --pool <pool_name>
Error:
Copy code
Worker 'KubernetesWorker 9cac12a5-741d-4150-b822-f042f1e07392' submitting flow run '1ea9b81d-09ca-4136-bee6-8bb855062056'
09:19:54 PM
prefect.flow_runs.worker
Creating Kubernetes job...
09:19:54 PM
prefect.flow_runs.worker
Failed to submit flow run '1ea9b81d-09ca-4136-bee6-8bb855062056' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.12/site-packages/urllib3/connection.py", line 419, in connect
    self.sock = ssl_wrap_socket(
                ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/ssl_.py", line 453, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/ssl_.py", line 495, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/ssl.py", line 455, in wrap_socket
    return self.sslsocket_class._create(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/ssl.py", line 1042, in _create
    self.do_handshake()
  File "/usr/local/lib/python3.12/ssl.py", line 1320, in do_handshake
    self._sslobj.do_handshake()
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/prefect/workers/base.py", line 896, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/prefect_kubernetes/worker.py", line 578, in run
    job = await run_sync_in_worker_thread(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/prefect/utilities/asyncutils.py", line 95, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/__init__.py", line 185, in reraise
    raise self.last_attempt.result()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.12/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/prefect_kubernetes/worker.py", line 793, in _create_job
    job = batch_client.create_namespaced_job(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/api_client.py", line 391, in request
    return <http://self.rest_client.POST|self.rest_client.POST>(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 279, in POST
    return self.request("POST", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes/client/rest.py", line 172, in request
    r = self.pool_manager.request(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/request.py", line 81, in request
    return self.request_encode_body(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/request.py", line 173, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/poolmanager.py", line 376, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.12/site-packages/urllib3/connection.py", line 419, in connect
    self.sock = ssl_wrap_socket(
                ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/ssl_.py", line 453, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/ssl_.py", line 495, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/ssl.py", line 455, in wrap_socket
    return self.sslsocket_class._create(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/ssl.py", line 1042, in _create
    self.do_handshake()
  File "/usr/local/lib/python3.12/ssl.py", line 1320, in do_handshake
    self._sslobj.do_handshake()
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
09:20:29 PM
prefect.flow_runs.worker
Completed submission of flow run '1ea9b81d-09ca-4136-bee6-8bb855062056'
09:20:29 PM
prefect.flow_runs.worker
Reported flow run '1ea9b81d-09ca-4136-bee6-8bb855062056' as crashed: Flow run could not be submitted to infrastructure
Has anyone encountered this issue or have suggestions on how to troubleshoot? We've double-checked our API key and URL, but the error persists. Any help would be greatly appreciated!
k
The trace indicates this error occurred in
create_namespaced_job
, where the worker is trying to communicate with the Kubernetes API in your cluster. Did this happen one time, or on every flow run?
k
Hi Kevin, it happens on every flow run
k
and your worker is deployed in your cluster, trying to create jobs in the correct namespace with the correct permissions?
k
Yes the worker is deployed in our own cluster and can you please let me know how can i verify the worker has correct permissions with correct namespace below is the command i executed but i couldnt find any thing related to what permissions worker has been assigned prefect work-pool inspect omnibus-data-ops-pool
i can share complete kube runner yaml file if required
k
kubectl describe deployment prefect-worker -n prefect
output should include the name of a service account that's attached to the worker deployment. Don't post the output here as it may contain API keys
k
Sure, Thanks!!
k
once you have the service account name, you should be able to check its permissions. One other thing, did this just start happening recently or has this been the case since you started your worker?
k
its been since i started the worker few days back
we were upgrading from prefect 2 from prefect 1
k
ah I see. sounds like there is most likely something wrong with the worker deployment in kubernetes then. did you use the helm chart?
k
we used kube commands to deploy and not helm chart
we have runner yaml file with configuration and deploying using Kustomize
I execute below command, kind is set to statefulset instead of deployment when i look at UI the namespace is set to default
Copy code
kubectl describe statefulset runner -n maestro
Output:
Copy code
Name:               runner
Namespace:          maestro
CreationTimestamp:  Mon, 08 Jul 2024 13:10:49 -0600
Selector:           app=maestro,process=runner
Labels:             app=maestro
                    <http://app.kubernetes.io/instance=maestro|app.kubernetes.io/instance=maestro>
                    process=runner
Annotations:        <none>
Replicas:           1 desired | 1 total
Update Strategy:    RollingUpdate
  Partition:        0
Pods Status:        1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           <http://admission.datadoghq.com/enabled=true|admission.datadoghq.com/enabled=true>
                    app=maestro
                    envoy-flavor=<http://xxxxx.com/patch-ndots=true|xxxxx.com/patch-ndots=true>
                    process=runner
                    <http://tags.xxxxxxxx.com/service=maestro|tags.xxxxxxxx.com/service=maestro>
                    <http://tags.xxxxxxxx.com/version=52e9b76806e0152534af0a9d99c457d864a86d5f|tags.xxxxxxxx.com/version=52e9b76806e0152534af0a9d99c457d864a86d5f>
                    version=stable
  Annotations:      co.elastic.logs/enabled: true
                    <http://proxy.istio.io/config|proxy.istio.io/config>: {"holdApplicationUntilProxyStarts": true}
  Service Account:  maestro
  Containers:
   maestro:
    Image:      docker-local.artifactory.internal/maestro-reprise:52e9b76806e0152534af0a9d99c457d864a86d5f
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
    Args:
      prefect config set PREFECT_API_URL=$PREFECT_API_URL
      prefect config set PREFECT_API_KEY=$PREFECT_API_KEY
      prefect deploy --all
      prefect worker start --pool omnibus-data-ops-pool
      
    Limits:
      cpu:     4
      memory:  3Gi
    Requests:
      cpu:     1
      memory:  2Gi
    Environment Variables from:
      env-4tg25mhd4b  ConfigMap  Optional: false
      env-542m7b2gm6  Secret     Optional: false
    Environment:      <none>
    Mounts:
      /srv/prefect/tmp from maestro-runner-persistent-storage (rw)
  Volumes:  <none>
Volume Claims:
  Name:          maestro-runner-persistent-storage
  StorageClass:  xfs-rook-block
  Labels:        <none>
  Annotations:   <none>
  Capacity:      50Gi
  Access Modes:  [ReadWriteOnce]
Events:          <none>
@Kevin Grismore Please let me know if you need any additional information. Thanks
k
this looks pretty typical to me
I think my primary suggestion at the moment is refer to the helm chart for the role, role binding, and service account that should be attached to the worker. https://github.com/PrefectHQ/prefect-helm/tree/main/charts/prefect-worker
even if you don't use it directly, you should probably try imitating all the components it represents
you can see in the role which permissions on each api group the worker needs to do its job successfully
k
Thanks kevin, i will try to replicate and will let you know on the update
👍 1
@Kevin Grismore, Still facing connection issue. please find yaml files which i have created and attached base job template, is there any misconfiguration you can point out or missing any. Thanks role.yaml
Copy code
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
  name: runner
  namespace: maestro
rules:
- apiGroups: [""]
  resources: ["events", "pods", "pods/log", "pods/status"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
role-binding.yaml
Copy code
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: RoleBinding
metadata:
  name: runner
  namespace: maestro
roleRef:
  apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
  kind: Role
  name: runner
subjects:
- kind: ServiceAccount
  name: maestro
  namespace: maestro
service_account.yaml
Copy code
apiVersion: v1
kind: ServiceAccount
metadata:
  name: runner
  namespace: maestro
deployment_runner.yaml
Copy code
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: runner
  labels:
    app: maestro
    process: runner
  annotations:
spec:
  replicas: 0
  selector:
    matchLabels:
      app: maestro
      process: runner
  template:
    metadata:
      labels:
        <http://admission.datadoghq.com/enabled|admission.datadoghq.com/enabled>: "true"
        app: maestro
        envoy-flavor: mx
        process: runner
        version: stable
        <http://mx.com/patch-ndots|mx.com/patch-ndots>: "true"
      annotations:
        co.elastic.logs/enabled: "true"
        <http://proxy.istio.io/config|proxy.istio.io/config>: '{"holdApplicationUntilProxyStarts": true}'
    spec:
      serviceAccountName: maestro
      containers:
        - image: docker/maestro-reprise
          name: maestro
          resources:
            requests:
              memory: 2Gi #TODO
              cpu: 1000m #TODO
            limits:
              memory: 3Gi #TODO
              cpu: 4000m #TODO
          command: ["/bin/bash", "-c"]
          args:
            - |
              prefect config set PREFECT_API_URL=$PREFECT_API_URL
              prefect config set PREFECT_API_KEY=$PREFECT_API_KEY
              prefect deploy --all
              prefect worker start --pool omnibus-data-ops-pool
          readinessProbe:
            #TODO
          env:
          envFrom:
            - configMapRef:
                name: env
            - secretRef:
                name: env
          volumeMounts:
            - name: maestro-runner-persistent-storage
              mountPath: /srv/prefect/tmp
  volumeClaimTemplates:
    - metadata:
        name: maestro-runner-persistent-storage
      spec:
        storageClassName: xfs-rook-block
        accessModes: [ReadWriteOnce]
        resources:
          requests:
            storage: 50Gi
base_job_template
Copy code
WorkPool(
    id='8d959567-796f-437e-b0fa-aef3d8f13e49',
    created=DateTime(2024, 7, 3, 20, 28, 7, 22301, tzinfo=FixedTimezone(0, name="+00:00")),
    updated=DateTime(2024, 7, 12, 18, 36, 15, 78634, tzinfo=FixedTimezone(0, name="+00:00")),
    name='omnibus-data-ops-pool',
    type='kubernetes',
    base_job_template={
        'variables': {
            'type': 'object',
            'properties': {
                'env': {
                    'type': 'object',
                    'title': 'Environment Variables',
                    'description': 'Environment variables to set when starting a flow run.',
                    'additionalProperties': {'type': 'string'}
                },
                'name': {'type': 'string', 'title': 'Name', 'default': 'runner', 'description': 'Name given to infrastructure created by a worker.'},
                'image': {
                    'type': 'string',
                    'title': 'Image',
                    'example': '<http://docker.io/prefecthq/prefect:2-latest|docker.io/prefecthq/prefect:2-latest>',
                    'description': 'The image reference of a container image to use for created jobs. If not set, the latest Prefect image will be used.'
                },
                'labels': {'type': 'object', 'title': 'Labels', 'description': 'Labels applied to infrastructure created by a worker.', 'additionalProperties': {'type': 'string'}},
                'command': {
                    'type': 'string',
                    'title': 'Command',
                    'description': 'The command to use when starting a flow run. In most cases, this should be left blank and the command will be automatically generated by the worker.'
                },
                'namespace': {'type': 'string', 'title': 'Namespace', 'default': 'maestro', 'description': 'The Kubernetes namespace to create jobs within.'},
                'stream_output': {'type': 'boolean', 'title': 'Stream Output', 'default': True, 'description': 'If set, output will be streamed from the job to local standard output.'},
                'cluster_config': {
                    'allOf': [{'$ref': '#/definitions/KubernetesClusterConfig'}],
                    'title': 'Cluster Config',
                    'description': 'The Kubernetes cluster config to use for job creation.'
                },
                'finished_job_ttl': {
                    'type': 'integer',
                    'title': 'Finished Job TTL',
                    'description': 'The number of seconds to retain jobs after completion. If set, finished jobs will be cleaned up by Kubernetes after the given delay. If not set, jobs will
be retained indefinitely.'
                },
                'image_pull_policy': {
                    'enum': ['IfNotPresent', 'Always', 'Never'],
                    'type': 'string',
                    'title': 'Image Pull Policy',
                    'default': 'IfNotPresent',
                    'description': 'The Kubernetes image pull policy to use for job containers.'
                },
                'service_account_name': {'type': 'string', 'title': 'Service Account Name', 'description': 'The Kubernetes service account to use for job creation.'},
                'job_watch_timeout_seconds': {
                    'type': 'integer',
                    'title': 'Job Watch Timeout Seconds',
                    'description': 'Number of seconds to wait for each event emitted by a job before timing out. If not set, the worker will wait for each event indefinitely.'
                },
                'pod_watch_timeout_seconds': {
                    'type': 'integer',
                    'title': 'Pod Watch Timeout Seconds',
                    'default': 60,
                    'description': 'Number of seconds to watch for pod creation before timing out.'
                }
            },
            'definitions': {
                'KubernetesClusterConfig': {
                    'type': 'object',
                    'title': 'KubernetesClusterConfig',
                    'required': ['config', 'context_name'],
                    'properties': {
                        'config': {'type': 'object', 'title': 'Config', 'description': 'The entire contents of a kubectl config file.'},
                        'context_name': {'type': 'string', 'title': 'Context Name', 'description': 'The name of the kubectl context to use.'}
                    },
                    'description': 'Stores configuration for interaction with Kubernetes clusters.\n\nSee `from_file` for creation.',
                    'secret_fields': [],
                    'block_type_slug': 'kubernetes-cluster-config',
                    'block_schema_references': {}
                }
            },
            'description': 'Default variables for the Kubernetes worker.\n\nThe schema for this class is used to populate the `variables` section of the default\nbase job template.'
        },
        'job_configuration': {
            'env': '{{ env }}',
            'name': '{{ name }}',
            'labels': '{{ labels }}',
            'command': '{{ command }}',
            'namespace': '{{ namespace }}',
            'job_manifest': {
                'kind': 'Job',
                'spec': {
                    'template': {
                        'spec': {
                            'containers': [{'env': '{{ env }}', 'args': '{{ command }}', 'name': 'prefect-job', 'image': '{{ image }}', 'imagePullPolicy': '{{ image_pull_policy }}'}],
                            'completions': 1,
                            'parallelism': 1,
                            'restartPolicy': 'Never',
                            'serviceAccountName': '{{ service_account_name }}'
                        }
                    },
                    'backoffLimit': 0,
                    'ttlSecondsAfterFinished': '{{ finished_job_ttl }}'
                },
                'metadata': {'labels': '{{ labels }}', 'namespace': '{{ namespace }}', 'generateName': '{{ name }}-'},
                'apiVersion': 'batch/v1'
            },
            'stream_output': '{{ stream_output }}',
            'cluster_config': '{{ cluster_config }}',
            'job_watch_timeout_seconds': '{{ job_watch_timeout_seconds }}',
            'pod_watch_timeout_seconds': '{{ pod_watch_timeout_seconds }}'
        }
    },
    concurrency_limit=5,
    status=WorkPoolStatus.READY,
    default_queue_id='70e699f3-4483-48be-aa3b-e2c5a4bb2573'
)
prefect.yaml
Copy code
definitions:
    work_pools:
        work_pool: &work_pool
            name: test-omnibus-data-ops-pool

deployments:

- name: test_flows
  entrypoint: flows/test.py:run_test
  schedules:
    - cron: "15 * * * *" # At 15 minutes past the hour, every hour, every day
  work_pool: *work_pool
k
it seems like the role is not bound to the correct service account from the rolebinding:
Copy code
subjects:
- kind: ServiceAccount
  name: maestro
  namespace: maestro
but the service account has a different name:
Copy code
metadata:
  name: runner
  namespace: maestro
k
Thanks, trying above changes
Still same issue
Copy code
kubectl get role runner -n maestro -o yaml
output:
Copy code
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"maestro"},"name":"runner","namespace":"maestro"},"rules":[{"apiGroups":[""],"resources":["events","pods","pods/log","pods/status"],"verbs":["get","watch","list"]},{"apiGroups":["batch"],"resources":["jobs"],"verbs":["get","list","watch","create","update","patch","delete"]}]}
  creationTimestamp: "2024-07-12T15:13:39Z"
  labels:
    app.kubernetes.io/instance: maestro
  name: runner
  namespace: maestro
  resourceVersion: "1167285778"
  uid: 06d5ef39-1dc0-4a19-9773-fbd8e2a1e916
rules:
- apiGroups:
  - ""
  resources:
  - events
  - pods
  - pods/log
  - pods/status
  verbs:
  - get
  - watch
  - list
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete
kubectl get serviceaccount maestro -n maestro -o yaml Output:
Copy code
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"labels":{"account":"maestro","app.kubernetes.io/instance":"maestro","datadog":"consumer","dmzegress":"consumer","dmzproxy":"consumer","netpolset":"app-default","pgproxy":"consumer","provider":"maestro","vertica":"consumer"},"name":"maestro","namespace":"maestro"}}
  creationTimestamp: "2024-06-25T20:40:55Z"
  labels:
    account: maestro
    app.kubernetes.io/instance: maestro
    datadog: consumer
    dmzegress: consumer
    dmzproxy: consumer
    netpolset: app-default
    pgproxy: consumer
    provider: maestro
    vertica: consumer
  name: maestro
  namespace: maestro
  resourceVersion: "1148806698"
  uid: 13fdad24-a527-4373-99ab-45a28032e630
kubectl get rolebindings -n maestro -o yaml output:
Copy code
apiVersion: v1
items:
- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"RoleBinding","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"namespaces","sand-security":"open"},"name":"application-core:teleport-application-role","namespace":"maestro"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"teleport-application-role"},"subjects":[{"apiGroup":"rbac.authorization.k8s.io","kind":"Group","name":"application-core"},{"apiGroup":"rbac.authorization.k8s.io","kind":"Group","name":"system:authenticated"}]}
    creationTimestamp: "2024-06-25T20:41:17Z"
    labels:
      app.kubernetes.io/instance: namespaces
      sand-security: open
    name: application-core:teleport-application-role
    namespace: maestro
    resourceVersion: "1146528188"
    uid: da570fc7-9966-48e1-a5f5-38646dafdc7a
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: teleport-application-role
  subjects:
  - apiGroup: rbac.authorization.k8s.io
    kind: Group
    name: application-core
  - apiGroup: rbac.authorization.k8s.io
    kind: Group
    name: system:authenticated
- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"RoleBinding","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"maestro"},"name":"runner","namespace":"maestro"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"runner"},"subjects":[{"kind":"ServiceAccount","name":"maestro","namespace":"maestro"}]}
    creationTimestamp: "2024-07-12T15:13:39Z"
    labels:
      app.kubernetes.io/instance: maestro
    name: runner
    namespace: maestro
    resourceVersion: "1167594198"
    uid: f658589d-b219-4b79-80e7-61b284745353
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: runner
  subjects:
  - kind: ServiceAccount
    name: maestro
    namespace: maestro
kind: List
metadata:
  resourceVersion: ""
I analyzed below things not sure what im missing 1. role "runner": ◦ has appropriate permissions for pods and jobs ◦ allows creating, updating, and deleting jobs 2. serviceAccount "maestro": ◦ exists in the "maestro" namespace ◦ has various labels attached 3. rolebinding "runner": ◦ correctly binds the "runner" Role to the "maestro" ServiceAccount ◦ is in the correct namespace (maestro) ◦ the RBAC (Role-Based Access Control) setup appears to be correct and the "maestro" servicecccount should have the necessary permissions to create and manage jobs in the "maestro" namespace. @Kevin Grismore can you please point out what im missing here. Appreciate your help on this. Thanks!!
below is the egress permission we have added, do you think we need to add ingress as well
Copy code
metadata:
  name: egress-routes
spec:
  routes:
    - host: api.prefect.cloud
      port: 443
k
the error you initially reported happens when the worker tries to create a job inside the designated namespace, so I don't think it's related to egress. however it is true that egress on 443 is required. No ingress is required for a worker
k
Got it, we have enabled egress so thats not causing the isse
tried all the role, role binding permission to service account, not sure where we are going wrong