Hi everyone I ve got two agents on my k8s cluster both of th Prefect Community #ask-community

Hi everyone! I've got two agents on my k8s cluster...

Gabriel Milan

02/24/2022, 9:41 PM

Hi everyone! I've got two agents on my k8s cluster, both of them are on the

prefect

namespace and work fine. I needed to add one more agent to the cluster, but this one should be on another namespace, say

prefect-agent-xxxx

. When I do this, I can successfully submit runs to it, they do get deployed, but it doesn't seem to actually run and no logs are shown. I've tried configuring the apollo URL to

<http://prefect-apollo.prefect.svc.cluster.local:4200>

and also setting an

ExternalName

to it in the

prefect-agent-xxxx

namespace and using it, but none of them works. Any ideas on how I could debug this?

discourse 1

Kevin Kho

02/24/2022, 9:53 PM

Maybe put debug level logs on the agent?

Gabriel Milan

02/24/2022, 9:55 PM

agent logs seems to be fine

Gabriel Milan

02/24/2022, 9:56 PM

it's just jobs that keeps "running" forever and then I eventually get this on the UI

Kevin Kho

02/24/2022, 9:57 PM

Oh I think debug level logs might give more insight but that does look like and it’s just unable to get compute. Not sure how to debug off the top of my head let’s see if the community chimes in

Matthias

02/24/2022, 10:52 PM

Maybe try spinning up an agent with

--show-flow-logs

. It could give you more insights into the issue

upvote 1

Gabriel Milan

02/24/2022, 11:54 PM

this option doesn't work here, unfortunately, I'm launching the agent using

prefect agent kubernetes start --job-template {{ .Values.agent.jobTemplateFilePath }}

Matthias

02/25/2022, 6:11 AM

And why can't you add the additional argument? The job template only references the manifest of the job that the agency is supposed to submit.

Gabriel Milan

02/25/2022, 11:40 AM

because this is not actually an option

Matthias

02/25/2022, 7:00 PM

What you could do is change the deployment spec manually and deploy that one (just for debugging)

Gabriel Milan

02/25/2022, 7:56 PM

That's what I'm trying to do, I'm changing the deployment command in order to add this option you've mentioned, but it doesn't work

Matthias

02/25/2022, 8:00 PM

Oh yeah, my mistake! It is not part of the agent code

Anna Geller

02/25/2022, 10:08 PM

I haven’t understood what’s exactly the issue here - your flow run pods die instantly because they can’t talk to your Server API? It seems that you have successfully deployed your third agent to a separate namespace. Some questions: Q1: What

label

did you assign to that agent? Your flow runs are correctly deployed and we can see that in the agent logs, so label shouldn’t be an issue, but still worth sharing that for debugging. Q2: Can you inspect the flow run pods and check the logs there? You could check the pods in this namespace and inspect the Kubernetes jobs and pods deployed there. Are flow run and task run states getting updated in your Server backend? You could potentially check that in your Server logs somewhere. Q3: You wrote “it doesn't seem to actually run and no logs are shown” - what doesn’t run? Do you mean you don’t see the flow run logs and updates being reflected in your Server UI? Q4: Where did you configure your Server Apollo endpoint - did you set it in the agent manifest as env variable as shown here?

Copy code

env:
  - name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
    value: ''
  - name: PREFECT__CLOUD__API
    value: "http://[prefect-apollo.prefect.svc.cluster.local](<<http://prefect-apollo.prefect.svc.cluster.local:4200/>>):4200/graphql" # paste your GraphQL Server
  - name: PREFECT__BACKEND
    value: server

Q5: How did you configure your flow runs that got deployed to this agent (

KubernetesRun

)? Q6: Didn’t you explicitly set the namespace when deploying the YAML file of the

KubernetesAgent

? Some immediate ideas to check/inspect or try: I would recommend creating a manifest file using:

Copy code

prefect agent kubernetes install --rbac > third_agent.yaml

Then adjusting the env variables as above and deploying it to a desired namespace this way:

Copy code

kubectl apply -f third_agent.yaml -n yournamespace

Then, all the flow run Kubernetes jobs should also be deployed to this namespace. Then, only networking and permission issues remains so that your flow run pods can talk to your Server in a separate namespace and your Service with

ExternalName

seems like the right solution.

Copy code

kind: Service
apiVersion: v1
metadata:
  name: server-third-agent
  namespace: yournamespace
spec:
  type: ExternalName
  externalName: [prefect-apollo.prefect.svc.cluster.local](<<http://prefect-apollo.prefect.svc.cluster.local:4200/>>)
  ports:
  - port: 80
  - port: 443
  - port: 4200

I’m particularly guessing here when it comes to ports - I have no idea which ports exactly would need to be open, but would like to open this up for discussion - it may be an issue with ports. Can you then also check the logs of your Server components and this above service to check if you see any errors or missing permissions there? Finally, I would check RBAC on your third agent. It may also be an issue of missing

RoleBinding

to bind your third agent’s permissions to your Server’s namespace (or both namespaces):

Copy code

apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: RoleBinding
metadata:
  name: prefect-agent-rbac
  namespace: default # add your prefect-agent-xxxx here
roleRef:
  apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
  kind: Role
  name: prefect-agent-rbac
subjects:
- kind: ServiceAccount
  name: default

upvote 1

Gabriel Milan

02/25/2022, 11:02 PM

Before I proceed to the questions: no, the flow run pods don't die. They're there "running" forever. Q1:

datario

label Q2: there are no logs whatsoever in the flow run pods. I can only see a change of state in the UI when the run is actually submitted, but nothing else. where could I get Server logs? Q3: yes, and there're also no logs on the pods themselves Q4: the only env that you've shown that is not set on my agent is

PREFECT__CLOUD__AGENT__AUTH_TOKEN

, could that be a problem? all of my other agents work without it Q5: I've set them using

flow.run_config = KubernetesRun(image=constants.DOCKER_IMAGE.value)

. This

constants.DOCKER_IMAGE.value

is a valid docker image, the same I'm using for other agents Q6: I've deployed it by doing

helm upgrade --install prefect-agent -n <namespace> <mychart> -f values.yaml

. The chart I'm using is this one and my

values.yaml

file looks like this:

Copy code

agent:
  apollo_url: <http://prefect-apollo.prefect.svc.cluster.local:4200/>
  env: []
  image:
    name: prefecthq/prefect
    tag: 0.15.9
  job:
    resources:
      limits:
        cpu: ''
        memory: ''
      requests:
        cpu: ''
        memory: ''
  jobTemplateFilePath: myjobtemplateurl.yaml
  name: prefect-agent
  prefectLabels:
  - datario
  replicas: 1
  resources:
    limits:
      cpu: 100m
      memory: 128Mi
  serviceAccountName: prefect-agent

the job template looks like this

Copy code

apiVersion: batch/v1
kind: Job
spec:
  template:
    spec:
      containers:
        - name: flow
          envFrom:
            - secretRef:
                name: gcp-credentials
            - secretRef:
                name: vault-credentials
          env:
            - name: GOOGLE_APPLICATION_CREDENTIALS
              value: /mnt/creds.json
          volumeMounts:
            - name: gcp-sa
              mountPath: /mnt/
              readOnly: true
      volumes:
        - name: gcp-sa
          secret:
            secretName: gcp-sa

and all of those secrets are properly configured. Finally, I just wanted to add that I'll check those steps you've mentioned and get back asap

Gabriel Milan

02/25/2022, 11:31 PM

Alright, I found it out. Turns out the "issue" was our Docker image for runs: as we're using linkerd for deploying agents on multiple k8s clusters, our image uses

linkerd-await

for blocking on linkerd readiness. This third agent was deployed on a non-linkerd-injected namespace, thus "awaiting" forever on readiness. That's why our run pods would never die, show logs or update its state. After I've injected the namespace with linkerd, everything works. Thank you so much for the effort on understanding our scenario and all the help!

Anna Geller

02/26/2022, 1:19 AM

I'm glad you found it out since I've never heard of linkerd 😅 sounds like you spent a lot of time configuring all this up and you know a lot about managing Prefect Server with Kubernetes and Helm. Did you think about writing this up into some GitHub repo, Readme or a blog post? :) Absolutely no pressure, but if you you'd like to share your set up, I'm sure many users could benefit from your knowledge sharing whichever form you'd choose. Even opening a topic on discourse.prefect.io with a couple of bullet points and code snippets might be insightful.

👍 2

Gabriel Milan

02/26/2022, 3:50 PM

This is something we're planning to do in a near future for the city hall. I'll be glad to translate it and share with you then!

upvote 1

🙌 2

3 Views

Open in Slack

Previous Next