Hello I have been testing out prefect 2 0 we have been using Prefect Community #ask-community

Hello, I have been testing out prefect 2.0, we hav...

Nick Coy

09/15/2022, 5:34 PM

Hello, I have been testing out prefect 2.0, we have been using 1.0 for over a year and love it. I have a GKE autopilot cluster set up on GCP with an agent running in a workload. I have been testing out running a very simple flow, and everything is working well. However, when I look at the logs from the agent workload, it picks up the flow run, says the pod is starting, and then says the pod never started. The pods do start, the flow runs, and I see its successful in the UI. I am just wondering why the agent says the pod was never started. This is my first time using kubernetes btw so bare with me.

Nate

09/15/2022, 6:08 PM

Hi @Nick Coy 👋 how did you deploy your agent?

Nick Coy

09/15/2022, 6:15 PM

Hi @Nate I created a workload that uses a prefect image with our dependencies and a start command of prefect agent start <name of que>. I also added the api_url and api_key to the workload

Nate

09/15/2022, 6:25 PM

gotcha, do you might sharing agent logs for such an example?

Nick Coy

09/15/2022, 6:33 PM

Here is an example. The pod started and ran successfully

Nate

09/15/2022, 6:46 PM

I've had a similar issue in the past when my kubernetes-job block was pointing at the wrong k8s service account my first guess based on the info I have is that there's something strange about the IAM permissions on the workload running the agent process another thought: are you on prefect >= 2.3.0? (2.4.0 is most recent in case a version mismatch could be a culprit)

Nick Coy

09/15/2022, 6:48 PM

ah I did upgrade to 2.4.0 yesterday but the agent is running 2.3 I believe

Nick Coy

09/15/2022, 6:49 PM

Il try re-starting the agent workload to see if that fixes the issue

Ilya Galperin

09/15/2022, 10:01 PM

I think we fixed this by specifying a different pod timeout length in the kubernetes infrastructure definition in the deployment, it defaults to 60 seconds

Nick Coy

09/15/2022, 11:26 PM

I found the issue, it was with my cluster role. I had this. once I added "pods/status" that fixed the issue

Copy code

rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["batch", "extensions"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

8 Views

Open in Slack

Previous Next