https://prefect.io logo
m

Marwan Sarieddine

05/20/2020, 9:37 PM
Hello everyone - I have been struggling with
DaskKubernetesEnvironment
and setting imagePullSecrets to pull images from our private gitlab container registry … Note I am using a Kubernetes Agent polling from prefect cloud and using prefect version 0.11.2. A couple of questions: 1. I thought if I created the docker-registry secret using kubernetes, added imagePullSecrets to the podSpec and specified a custom scheduler_file and worker_spec file to DaskKubernetesEnvironment - this approach doesn’t rely on prefect secrets and I thought should work fine - but I am getting an empty dict for imagePullSecrets when inspecting the job and pod specs … I thought the kubernetes agent might be overwriting them (I saw the
replace_job_spec_yaml
method ) and so I also specified the secret name in the kubernetes agent manifest under the environment variable (IMAGE_PULL_SECRETS) — but still I am getting an empty dict for imagePullSecrets - any idea why ? 2. The other approach seems to be not to specify the worker_spec and scheduler file but to set private_registry to True and docker_secret to the name of the Prefect Secret, then I use the client to create the prefect secret by setting a name and value - I don’t see where in the code is the prefect secret’s value taken and a kubernetes secret is created ? (so I tried both using a dictionary of docker-server, docker-username, docker-password, and docker-email or just setting the value to the name of the k8s secret that I created) both approaches didnt work - and I am getting an empty dict for imagePullSecrets … Any idea what might be going on here ? and what is the best practice to setting k8s imagePullSecrets for
DaskKubernetesEnvironment
?
j

josh

05/20/2020, 9:45 PM
Hey @Marwan Sarieddine I actually think the private_registry option on that environment may be outdated/needs to be deprecated 🤔 will look into that. I feel like the environment should instead take the name of an image pull secret instead of trying to create one based on the contents of a Prefect secret. On your first point when you say you get the empty dict for imagePullSecrets I’m not sure where that would be overwritten as it uses the yaml you provide. The image pull secrets option on the Kubernetes agent is for the first
prefect-job
that is created and isn’t propagated down to the dask scheduler/workers. Let’s try to troubleshoot, first question: Are you seeing the initial
prefect-job
being created and then the following dask scheduler job/pod?
m

Marwan Sarieddine

05/20/2020, 9:56 PM
Hi @josh thank you for getting back to me so promptly Noted about the private_registry option. Yes I am seeing a
prefect-job
being created … to be specific here is how my k8s resources look:
Copy code
$ kubectl get pods,jobs
NAME                                 READY   STATUS         RESTARTS   AGE
pod/prefect-agent-5f6458886d-z4btq   2/2     Running        0          35m
pod/prefect-job-995c4982-5jzzf       0/1     ErrImagePull   0          6s

NAME                             COMPLETIONS   DURATION   AGE
job.batch/prefect-job-995c4982   0/1           6s         6s
j

josh

05/20/2020, 9:58 PM
If you both add the name of your image pull secret to the Kubernetes agent under
IMAGE_PULL_SECRETS
and keep the imagePullSecret in the yaml of your environment’s scheduler and worker does it start working?
m

Marwan Sarieddine

05/20/2020, 10:02 PM
that’s what I am currently doing - but it doesnt seem to be working ….
Copy code
$ kubectl get pod prefect-job-995c4982-5jzzf -o yaml | yq r - "spec.imagePullSecrets"
- {}
j

josh

05/20/2020, 10:02 PM
Can I see the part of your agent deployment yaml that has the
IMAGE_PULL_SECRETS
set?
m

Marwan Sarieddine

05/20/2020, 10:03 PM
sure
Copy code
containers:
      - args:
        - prefect agent start kubernetes
        command:
        - /bin/bash
        - -c
        env:
        - name: PREFECT__CLOUD__API
          value: <https://api.prefect.io>
        - name: NAMESPACE
          value: default
        - name: IMAGE_PULL_SECRETS
          value: gitlab-secret
        - name: PREFECT__CLOUD__AGENT__LABELS
          value: '[]'
        - name: JOB_MEM_REQUEST
          value: 256Mi
        - name: JOB_MEM_LIMIT
          value: 512Mi
        - name: JOB_CPU_REQUEST
          value: 500m
        - name: JOB_CPU_LIMIT
          value: 1000m
        - name: PREFECT__BACKEND
          value: cloud
        - name: PREFECT__CLOUD__AGENT__AGENT_ADDRESS
          value: http://:8080
        image: prefecthq/prefect:0.11.2-python3.6
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /api/health
            port: 8080
          initialDelaySeconds: 40
          periodSeconds: 40
        name: agent
        resources:
          limits:
            cpu: 100m
            memory: 128Mi
(took out the auth token and other stuff that didn’t seem relevant )
🙂 1
j

josh

05/20/2020, 10:05 PM
And could you also show me the output from doing a
kubectl describe
on the prefect job? Also removing any auth info of course
m

Marwan Sarieddine

05/20/2020, 10:07 PM
Copy code
$ kubectl describe pod prefect-job-995c4982-5jzzf 
Name:           prefect-job-995c4982-5jzzf
Namespace:      default
Priority:       0
Node:           ip-192-168-79-244.us-west-2.compute.internal/192.168.79.244
Start Time:     Wed, 20 May 2020 17:55:39 -0400
Labels:         app=prefect-job-995c4982
                controller-uid=5af44c8e-3301-44d3-9890-ffae403ad426
                flow_run_id=868e1b00-8fd5-4394-a3d2-5a2412e2a373
                identifier=995c4982
                job-name=prefect-job-995c4982
Annotations:    <http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
Status:         Pending
IP:             192.168.80.39
IPs:            <none>
Controlled By:  Job/prefect-job-995c4982
Containers:
  flow:
    Container ID:  
    Image:         <http://registry.gitlab.com/ifm-data-science/kubeflow-pipelines/mlops-examples/dask-k8s-flow:0.1.0|registry.gitlab.com/ifm-data-science/kubeflow-pipelines/mlops-examples/dask-k8s-flow:0.1.0>
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      prefect execute cloud-flow
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  512Mi
    Requests:
      cpu:     500m
      memory:  256Mi
    Environment:
      PREFECT__CLOUD__API:                          <https://api.prefect.io>
      PREFECT__CONTEXT__FLOW_RUN_ID:                868e1b00-8fd5-4394-a3d2-5a2412e2a373
      PREFECT__CONTEXT__FLOW_ID:                    156ce5ef-53c3-4f61-9dcc-004cc890e141
      PREFECT__CONTEXT__NAMESPACE:                  default
      PREFECT__CLOUD__AGENT__LABELS:                []
      PREFECT__LOGGING__LOG_TO_CLOUD:               true
      PREFECT__CLOUD__USE_LOCAL_SECRETS:            false
      PREFECT__LOGGING__LEVEL:                      DEBUG
      PREFECT__ENGINE__FLOW_RUNNER__DEFAULT_CLASS:  prefect.engine.cloud.CloudFlowRunner
      PREFECT__ENGINE__TASK_RUNNER__DEFAULT_CLASS:  prefect.engine.cloud.CloudTaskRunner
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-s27ks (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-s27ks:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-s27ks
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> for 300s
                 <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> for 300s
Events:
  Type     Reason     Age                  From                                                   Message
  ----     ------     ----                 ----                                                   -------
  Normal   Scheduled  11m                  default-scheduler                                      Successfully assigned default/prefect-job-995c4982-5jzzf to ip-192-168-79-244.us-west-2.compute.internal
  Normal   Pulling    9m25s (x4 over 10m)  kubelet, ip-192-168-79-244.us-west-2.compute.internal  Pulling image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>"
  Warning  Failed     9m24s (x4 over 10m)  kubelet, ip-192-168-79-244.us-west-2.compute.internal  Failed to pull image "<http://registry.gitlab.com/xxx|registry.gitlab.com/xxx>: denied: access forbidden
  Warning  Failed     9m24s (x4 over 10m)  kubelet, ip-192-168-79-244.us-west-2.compute.internal  Error: ErrImagePull
  Warning  Failed     8m57s (x7 over 10m)  kubelet, ip-192-168-79-244.us-west-2.compute.internal  Error: ImagePullBackOff
  Normal   BackOff    54s (x40 over 10m)   kubelet, ip-192-168-79-244.us-west-2.compute.internal  Back-off pulling image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>"
j

josh

05/20/2020, 10:10 PM
Could I also see the output of
Copy code
kubectl describe job prefect-job-995c4982
m

Marwan Sarieddine

05/20/2020, 10:11 PM
ah sorry
Copy code
$ kubectl describe job prefect-job-995c4982 
Name:           prefect-job-995c4982
Namespace:      default
Selector:       controller-uid=5af44c8e-3301-44d3-9890-ffae403ad426
Labels:         app=prefect-job-995c4982
                flow_id=156ce5ef-53c3-4f61-9dcc-004cc890e141
                flow_run_id=868e1b00-8fd5-4394-a3d2-5a2412e2a373
                identifier=995c4982
Annotations:    <none>
Parallelism:    1
Completions:    1
Start Time:     Wed, 20 May 2020 17:55:39 -0400
Pods Statuses:  1 Running / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app=prefect-job-995c4982
           controller-uid=5af44c8e-3301-44d3-9890-ffae403ad426
           flow_run_id=868e1b00-8fd5-4394-a3d2-5a2412e2a373
           identifier=995c4982
           job-name=prefect-job-995c4982
  Containers:
   flow:
    Image:      <http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
      -c
    Args:
      prefect execute cloud-flow
    Limits:
      cpu:     1
      memory:  512Mi
    Requests:
      cpu:     500m
      memory:  256Mi
    Environment:
      PREFECT__CLOUD__API:                          <https://api.prefect.io>
      PREFECT__CONTEXT__FLOW_RUN_ID:                868e1b00-8fd5-4394-a3d2-5a2412e2a373
      PREFECT__CONTEXT__FLOW_ID:                    156ce5ef-53c3-4f61-9dcc-004cc890e141
      PREFECT__CONTEXT__NAMESPACE:                  default
      PREFECT__CLOUD__AGENT__LABELS:                []
      PREFECT__LOGGING__LOG_TO_CLOUD:               true
      PREFECT__CLOUD__USE_LOCAL_SECRETS:            false
      PREFECT__LOGGING__LEVEL:                      DEBUG
      PREFECT__ENGINE__FLOW_RUNNER__DEFAULT_CLASS:  prefect.engine.cloud.CloudFlowRunner
      PREFECT__ENGINE__TASK_RUNNER__DEFAULT_CLASS:  prefect.engine.cloud.CloudTaskRunner
    Mounts:                                         <none>
  Volumes:                                          <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  15m   job-controller  Created pod: prefect-job-995c4982-5jzzf
j

josh

05/20/2020, 10:13 PM
All good! One more thing, could I see the describe of the agent deployment? Same thing with no auth
m

Marwan Sarieddine

05/20/2020, 10:13 PM
Copy code
$ kubectl describe deployments prefect-agent 
Name:                   prefect-agent
Namespace:              default
CreationTimestamp:      Wed, 20 May 2020 13:13:47 -0400
Labels:                 app=prefect-agent
Annotations:            <http://deployment.kubernetes.io/revision|deployment.kubernetes.io/revision>: 6
Selector:               app=prefect-agent
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=prefect-agent
  Containers:
   agent:
    Image:      prefecthq/prefect:0.11.2-python3.6
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
    Args:
      prefect agent start kubernetes
    Limits:
      cpu:     100m
      memory:  128Mi
    Liveness:  http-get http://:8080/api/health delay=40s timeout=1s period=40s #success=1 #failure=2
    Environment:
      PREFECT__CLOUD__API:                   <https://api.prefect.io>
      NAMESPACE:                             default
      IMAGE_PULL_SECRETS:                    gitlab-secret
      PREFECT__CLOUD__AGENT__LABELS:         []
      JOB_MEM_REQUEST:                       256Mi
      JOB_MEM_LIMIT:                         512Mi
      JOB_CPU_REQUEST:                       500m
      JOB_CPU_LIMIT:                         1000m
      PREFECT__BACKEND:                      cloud
      PREFECT__CLOUD__AGENT__AGENT_ADDRESS:  http://:8080
    Mounts:                                  <none>
   resource-manager:
    Image:      prefecthq/prefect:0.11.2-python3.6
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
    Args:
      python -c 'from prefect.agent.kubernetes import ResourceManager; ResourceManager().start()'
    Limits:
      cpu:     100m
      memory:  128Mi
    Environment:
      PREFECT__CLOUD__API:                                     <https://api.prefect.io>
      PREFECT__CLOUD__AGENT__RESOURCE_MANAGER__LOOP_INTERVAL:  60
      NAMESPACE:                                               default
    Mounts:                                                    <none>
  Volumes:                                                     <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   prefect-agent-67b846cc76 (1/1 replicas created)
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  53m    deployment-controller  Scaled up replica set prefect-agent-5f6458886d to 1
  Normal  ScalingReplicaSet  53m    deployment-controller  Scaled down replica set prefect-agent-587c5d8cdd to 0
  Normal  ScalingReplicaSet  11m    deployment-controller  Scaled up replica set prefect-agent-d6668bb6d to 1
  Normal  ScalingReplicaSet  11m    deployment-controller  Scaled down replica set prefect-agent-5f6458886d to 0
  Normal  ScalingReplicaSet  7m57s  deployment-controller  Scaled up replica set prefect-agent-67b846cc76 to 1
  Normal  ScalingReplicaSet  7m52s  deployment-controller  Scaled down replica set prefect-agent-d6668bb6d to 0
j

josh

05/20/2020, 10:17 PM
And just to be certain, the
gitlab-secret
exists correct? 😄
m

Marwan Sarieddine

05/20/2020, 10:17 PM
lol yep
Copy code
$ kubectl get secret -o wide
NAME                  TYPE                                  DATA   AGE
default-token-s27ks   <http://kubernetes.io/service-account-token|kubernetes.io/service-account-token>   3      8h
gitlab-secret         <http://kubernetes.io/dockerconfigjson|kubernetes.io/dockerconfigjson>        1      5h7m
j

josh

05/20/2020, 10:22 PM
Weird! Not sure how but it looks like it isn’t attaching that image pull secret. The thing that makes it weird is that the code is pretty simple haha and only:
Copy code
# Use image pull secrets if provided
job["spec"]["template"]["spec"]["imagePullSecrets"][0]["name"] = os.getenv("IMAGE_PULL_SECRETS", "")
Which it looks like your agent has that env var set. My last test would be to check if you can create a pod that uses your image pull secret with something like:
Copy code
apiVersion: v1
kind: Pod
metadata:
  name: private-reg
spec:
  containers:
  - name: private-reg-container
    image: <your-private-image>
  imagePullSecrets:
  - name: your-secret
If that pull works then there is a bug in the agent code!
m

Marwan Sarieddine

05/20/2020, 10:24 PM
ok will check that - one other thing I want to check is
Copy code
- name: IMAGE_PULL_SECRETS
          value: "gitlab-secret"
instead of
Copy code
- name: IMAGE_PULL_SECRETS
          value: gitlab-secret
just to be sure it’s not because of quotes missing
👍 1
j

josh

05/20/2020, 10:25 PM
Awesome, if that is able to pull then could you open an issue with this bug?
m

Marwan Sarieddine

05/20/2020, 10:27 PM
sure - will do - thanks for helping in debugging
@josh not sure what it was - but it is working now 🤷‍♂️ it is not the quotes - at least I dont think so because it says:
Copy code
deployment.apps/prefect-agent unchanged
when I add the quotes
j

josh

05/20/2020, 10:45 PM
Awesome!
m

Marwan Sarieddine

05/20/2020, 10:45 PM
(thanks again for the help - glad it is not a bug)
j

josh

05/20/2020, 10:46 PM
Anytime! Haha I'm glad it's not a bug as well 😉
👍 1