https://prefect.io logo
Title
g

Gabriel Milan

02/24/2022, 9:41 PM
Hi everyone! I've got two agents on my k8s cluster, both of them are on the
prefect
namespace and work fine. I needed to add one more agent to the cluster, but this one should be on another namespace, say
prefect-agent-xxxx
. When I do this, I can successfully submit runs to it, they do get deployed, but it doesn't seem to actually run and no logs are shown. I've tried configuring the apollo URL to
<http://prefect-apollo.prefect.svc.cluster.local:4200>
and also setting an
ExternalName
to it in the
prefect-agent-xxxx
namespace and using it, but none of them works. Any ideas on how I could debug this?
:discourse: 1
k

Kevin Kho

02/24/2022, 9:53 PM
Maybe put debug level logs on the agent?
g

Gabriel Milan

02/24/2022, 9:55 PM
agent logs seems to be fine
it's just jobs that keeps "running" forever and then I eventually get this on the UI
k

Kevin Kho

02/24/2022, 9:57 PM
Oh I think debug level logs might give more insight but that does look like and it’s just unable to get compute. Not sure how to debug off the top of my head let’s see if the community chimes in
m

Matthias

02/24/2022, 10:52 PM
Maybe try spinning up an agent with
--show-flow-logs
. It could give you more insights into the issue
:upvote: 1
g

Gabriel Milan

02/24/2022, 11:54 PM
this option doesn't work here, unfortunately, I'm launching the agent using
prefect agent kubernetes start --job-template {{ .Values.agent.jobTemplateFilePath }}
m

Matthias

02/25/2022, 6:11 AM
And why can't you add the additional argument? The job template only references the manifest of the job that the agency is supposed to submit.
g

Gabriel Milan

02/25/2022, 11:40 AM
because this is not actually an option
m

Matthias

02/25/2022, 7:00 PM
What you could do is change the deployment spec manually and deploy that one (just for debugging)
g

Gabriel Milan

02/25/2022, 7:56 PM
That's what I'm trying to do, I'm changing the deployment command in order to add this option you've mentioned, but it doesn't work
m

Matthias

02/25/2022, 8:00 PM
Oh yeah, my mistake! It is not part of the agent code
a

Anna Geller

02/25/2022, 10:08 PM
I haven’t understood what’s exactly the issue here - your flow run pods die instantly because they can’t talk to your Server API? It seems that you have successfully deployed your third agent to a separate namespace. Some questions: Q1: What
label
did you assign to that agent? Your flow runs are correctly deployed and we can see that in the agent logs, so label shouldn’t be an issue, but still worth sharing that for debugging. Q2: Can you inspect the flow run pods and check the logs there? You could check the pods in this namespace and inspect the Kubernetes jobs and pods deployed there. Are flow run and task run states getting updated in your Server backend? You could potentially check that in your Server logs somewhere. Q3: You wrote “it doesn't seem to actually run and no logs are shown” - what doesn’t run? Do you mean you don’t see the flow run logs and updates being reflected in your Server UI? Q4: Where did you configure your Server Apollo endpoint - did you set it in the agent manifest as env variable as shown here?
env:
  - name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
    value: ''
  - name: PREFECT__CLOUD__API
    value: "http://[prefect-apollo.prefect.svc.cluster.local](<<http://prefect-apollo.prefect.svc.cluster.local:4200/>>):4200/graphql" # paste your GraphQL Server
  - name: PREFECT__BACKEND
    value: server
Q5: How did you configure your flow runs that got deployed to this agent (
KubernetesRun
)? Q6: Didn’t you explicitly set the namespace when deploying the YAML file of the
KubernetesAgent
? Some immediate ideas to check/inspect or try: I would recommend creating a manifest file using:
prefect agent kubernetes install --rbac > third_agent.yaml
Then adjusting the env variables as above and deploying it to a desired namespace this way:
kubectl apply -f third_agent.yaml -n yournamespace
Then, all the flow run Kubernetes jobs should also be deployed to this namespace. Then, only networking and permission issues remains so that your flow run pods can talk to your Server in a separate namespace and your Service with
ExternalName
seems like the right solution.
kind: Service
apiVersion: v1
metadata:
  name: server-third-agent
  namespace: yournamespace
spec:
  type: ExternalName
  externalName: [prefect-apollo.prefect.svc.cluster.local](<<http://prefect-apollo.prefect.svc.cluster.local:4200/>>)
  ports:
  - port: 80
  - port: 443
  - port: 4200
I’m particularly guessing here when it comes to ports - I have no idea which ports exactly would need to be open, but would like to open this up for discussion - it may be an issue with ports. Can you then also check the logs of your Server components and this above service to check if you see any errors or missing permissions there? Finally, I would check RBAC on your third agent. It may also be an issue of missing
RoleBinding
to bind your third agent’s permissions to your Server’s namespace (or both namespaces):
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: RoleBinding
metadata:
  name: prefect-agent-rbac
  namespace: default # add your prefect-agent-xxxx here
roleRef:
  apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
  kind: Role
  name: prefect-agent-rbac
subjects:
- kind: ServiceAccount
  name: default
:upvote: 1
g

Gabriel Milan

02/25/2022, 11:02 PM
Before I proceed to the questions: no, the flow run pods don't die. They're there "running" forever. Q1:
datario
label Q2: there are no logs whatsoever in the flow run pods. I can only see a change of state in the UI when the run is actually submitted, but nothing else. where could I get Server logs? Q3: yes, and there're also no logs on the pods themselves Q4: the only env that you've shown that is not set on my agent is
PREFECT__CLOUD__AGENT__AUTH_TOKEN
, could that be a problem? all of my other agents work without it Q5: I've set them using
flow.run_config = KubernetesRun(image=constants.DOCKER_IMAGE.value)
. This
constants.DOCKER_IMAGE.value
is a valid docker image, the same I'm using for other agents Q6: I've deployed it by doing
helm upgrade --install prefect-agent -n <namespace> <mychart> -f values.yaml
. The chart I'm using is this one and my
values.yaml
file looks like this:
agent:
  apollo_url: <http://prefect-apollo.prefect.svc.cluster.local:4200/>
  env: []
  image:
    name: prefecthq/prefect
    tag: 0.15.9
  job:
    resources:
      limits:
        cpu: ''
        memory: ''
      requests:
        cpu: ''
        memory: ''
  jobTemplateFilePath: myjobtemplateurl.yaml
  name: prefect-agent
  prefectLabels:
  - datario
  replicas: 1
  resources:
    limits:
      cpu: 100m
      memory: 128Mi
  serviceAccountName: prefect-agent
the job template looks like this
apiVersion: batch/v1
kind: Job
spec:
  template:
    spec:
      containers:
        - name: flow
          envFrom:
            - secretRef:
                name: gcp-credentials
            - secretRef:
                name: vault-credentials
          env:
            - name: GOOGLE_APPLICATION_CREDENTIALS
              value: /mnt/creds.json
          volumeMounts:
            - name: gcp-sa
              mountPath: /mnt/
              readOnly: true
      volumes:
        - name: gcp-sa
          secret:
            secretName: gcp-sa
and all of those secrets are properly configured. Finally, I just wanted to add that I'll check those steps you've mentioned and get back asap
Alright, I found it out. Turns out the "issue" was our Docker image for runs: as we're using linkerd for deploying agents on multiple k8s clusters, our image uses
linkerd-await
for blocking on linkerd readiness. This third agent was deployed on a non-linkerd-injected namespace, thus "awaiting" forever on readiness. That's why our run pods would never die, show logs or update its state. After I've injected the namespace with linkerd, everything works. Thank you so much for the effort on understanding our scenario and all the help!
a

Anna Geller

02/26/2022, 1:19 AM
I'm glad you found it out since I've never heard of linkerd 😅 sounds like you spent a lot of time configuring all this up and you know a lot about managing Prefect Server with Kubernetes and Helm. Did you think about writing this up into some GitHub repo, Readme or a blog post? :) Absolutely no pressure, but if you you'd like to share your set up, I'm sure many users could benefit from your knowledge sharing whichever form you'd choose. Even opening a topic on discourse.prefect.io with a couple of bullet points and code snippets might be insightful.
👍 2
g

Gabriel Milan

02/26/2022, 3:50 PM
This is something we're planning to do in a near future for the city hall. I'll be glad to translate it and share with you then!
:upvote: 1
🙌 2