Hi all - is there a way to enable verbose debuggin...
# prefect-community
i
Hi all - is there a way to enable verbose debugging on an agent in Prefect 2, or another method to troubleshoot an agent seemingly not connecting to a workspace?
👀 1
j
Hello @Ilya Galperin, we don't have a way to enable more verbose logs on the agent. When the agent is started, does it say that it's looking for work from the the right work queue? i.e
Agent started! Looking for work from queue(s): workque-chosen...
i
Hi Jean — it is pointing to the correct queue and the correct API url but we can’t tell if it’s actually succeeding in authenticating
j
Where is your agent running?
i
It is running on an EKS cluster
c
Hi @Ilya Galperin - this can be verified pretty quickly - if you are passing in a secret for your API key, you can modify the secret to something you know to be incorrect - the agent will then fail to connect successfully with a 401/403
if you aren’t seeing a 401/403, then it should be successfully connected. You can also review the api/agent logs individually
Why do you suspect it’s incorrect?
i
Thanks Christopher. I’m not entirely sure if it’s the API key or something else. We are not seeing a 401/403 error. The agent log looks normal:
Copy code
prefect-agent-8b76465df-sq4g7 agent Starting v2.3.2 agent connected to 
prefect-agent-8b76465df-sq4g7 agent <https://api.prefect.cloud/api/accounts/ACCOUNTID/work>
prefect-agent-8b76465df-sq4g7 agent spaces/WORKSPACEID...
prefect-agent-8b76465df-sq4g7 agent 
prefect-agent-8b76465df-sq4g7 agent   ___ ___ ___ ___ ___ ___ _____     _   ___ ___ _  _ _____
prefect-agent-8b76465df-sq4g7 agent  | _ \ _ \ __| __| __/ __|_   _|   /_\ / __| __| \| |_   _|
prefect-agent-8b76465df-sq4g7 agent  |  _/   / _|| _|| _| (__  | |    / _ \ (_ | _|| .` | | |
prefect-agent-8b76465df-sq4g7 agent  |_| |_|_\___|_| |___\___| |_|   /_/ \_\___|___|_|\_| |_|
prefect-agent-8b76465df-sq4g7 agent 
prefect-agent-8b76465df-sq4g7 agent 
prefect-agent-8b76465df-sq4g7 agent Agent started! Looking for work from queue(s): main...
However, no flows from our “main” queue are being picked up by this agent which makes me think there might be a communication problem between the agent and the server, whether it’s an incorrect API key or something else. Is there some other logs or mechanism we can use to validate that it is connecting successfully… you mentioned API logs?
One reason I thought it might be the API key or something with our configuration is that we didn’t see any errors being thrown when pointing the agent to an entirely incorrect URL i.e. google.com so it doesn’t seem like it cares if there’s a real connection?
c
How did you deploy your agent ? Is this in kubernetes , or a vm or local ?
It should definitely care if it's not a valid url or it's an invalid api key , but I can't tell much more without seeing the configuration unfortunately
i
It is in Kubernetes. We’re doing some more troubleshooting right now to see if this is a networking configuration issue with the cluster this agent is running on. If that were to be the case, would we still see an error? And if so, what error?
We definitely did not see an error when pointing to google.com
c
How are you viewing / retrieving logs ?
i
Are we sure there is logging for this on 2.3.2? If so, it might be missing some edge case that we’re running into?
We are using kubectl to see the logs of the pod the agent is running on
c
Can you sure your deployment / manifest you used to deploy this agent ?
You can redact api key / workspace
i
Copy code
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prefect-agent
  namespace: prefect2
spec:
  selector:
    matchLabels:
      app: prefect-agent
  replicas: 1
  template:
    metadata:
      labels:
        app: prefect-agent
    spec:
      serviceAccountName: default
      containers:
      - name: agent
        image: prefecthq/prefect:${var.prefect_tag}
        command: ["prefect", "agent", "start", "-q", "main"]
        imagePullPolicy: "IfNotPresent"
        env:
          - name: PREFECT_API_URL
            value: ${var.prefect_api_url}
          - name: PREFECT_API_KEY
            valueFrom:
              secretKeyRef:
                name: prefect-cloud 
                key: api_key
${var.prefect_tag}
in this case resolves to “2.3.2-python3.10”
and if you plug in prefect_api_url = “http://google.com” in this configuration we don’t get any error on the agent pod
c
Ok give me a few minutes
👍 1
i
Thank you
c
I’m deploying a fresh instance currently wiht the following:
Copy code
apiVersion: apps/v1
kind: Deployment
metadata:
  name: orion
  namespace: prefect2
spec:
  selector:
    matchLabels:
      app: orion
  replicas: 1  # We're using SQLite, so we should only run 1 pod
  template:
    metadata:
      labels:
        app: orion
    spec:
      containers:
      - name: api
        image: prefecthq/prefect:2.3.2-python3.10
        command: ["prefect", "orion", "start", "--host", "0.0.0.0", "--log-level", "WARNING"]
        imagePullPolicy: "IfNotPresent"
        ports:
        - containerPort: 4200
      - name: agent
        image: prefecthq/prefect:2.3.2-python3.10
        command: ["prefect", "agent", "start", "kubernetes"]
        imagePullPolicy: "IfNotPresent"
        env:
          - name: PREFECT_API_URL
            value: <https://api.prefect.cloud/api/accounts/><accountid>/workspaces/<workspaceid>
          - name: PREFECT_API_KEY
            value: bad_value

---
apiVersion: v1
kind: Service
metadata:
  name: orion
  namespace: prefect2
  labels:
    app: orion
spec:
  ports:
    - port: 4200
      protocol: TCP
  selector:
    app: orion
---
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
  namespace: prefect2
  name: flow-runner
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log", "pods/status"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: [ "get", "list", "watch", "create", "update", "patch", "delete" ]
---
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: RoleBinding
metadata:
  name: flow-runner-role-binding
  namespace: prefect2
subjects:
- kind: ServiceAccount
  name: default
  namespace: prefect2
roleRef:
  kind: Role
  name: flow-runner
  apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
once this comes up, I’ll update with bad api url (e.g. like you said google), and verify with a good api key afterwards for both
I received the following for the agent:
Copy code
prefect.exceptions.PrefectHTTPStatusError: Client error '401 Unauthorized' for url '<https://api.prefect.cloud/api/accounts/><redact>/workspaces/<redact>/work_queues/name/kubernetes'
Response: {'detail': 'Invalid authentication credentials'}
For more information check: <https://httpstatuses.com/401>
An exception occurred.
The following for the api:
Copy code
Configure Prefect to communicate with the server with:

    prefect config set PREFECT_API_URL=<http://0.0.0.0:4200/api>

View the API reference documentation at <http://0.0.0.0:4200/docs>

Check out the dashboard at <http://0.0.0.0:4200>



INFO:     Started server process [9]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on <http://0.0.0.0:4200> (Press CTRL+C to quit)
updating to a good API key first, then to a bad api url
with a good api key (same manifest, just modified the run time api key):
Copy code
Agents now support multiple work queues. Instead of passing a single argument, 
provide work queue names with the `-q` or `--work-queue` flag: `prefect agent 
start -q kubernetes`

Starting v2.3.2 agent connected to 
<https://api.prefect.cloud/api/accounts/redact/work>
spaces/redact...

  ___ ___ ___ ___ ___ ___ _____     _   ___ ___ _  _ _____
 | _ \ _ \ __| __| __/ __|_   _|   /_\ / __| __| \| |_   _|
 |  _/   / _|| _|| _| (__  | |    / _ \ (_ | _|| .` | | |
 |_| |_|_\___|_| |___\___| |_|   /_/ \_\___|___|_|\_| |_|


Agent started! Looking for work from queue(s): kubernetes...
with a bad url:
Copy code
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 88, in get_work_queues
    work_queue = await self.client.create_work_queue(name=name)
  File "/usr/local/lib/python3.10/site-packages/prefect/client.py", line 835, in create_work_queue
    response = await <http://self._client.post|self._client.post>("/work_queues/", json=data)
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1842, in post
    return await self.request(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1527, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/usr/local/lib/python3.10/site-packages/prefect/client.py", line 279, in send
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/prefect/client.py", line 225, in raise_for_status
    raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Client error '404 Not Found' for url '<http://google.com/work_queues/>'
For more information check: <https://httpstatuses.com/404>
agent communications should be outbound only, the agent will sleep, then check in with the api for work, before sleeping again, so an inbound connection or ingress isn’t required
i
Sorry, I’m not 100% following which test was which. Are you receiving an error when passing in a bad URL but a valid-looking key?
c
I receive a successful configuration with good key + good url. I receive a 401 with a bad key and good url I receive a 404 with a bad url and a good key
1
i
Also, would the
Service
manifest here have any impact on connectivity between Cloud and the Agent? My understanding is that this would only be for communication between pods spawned by the agent and the agent itself, right?
That is really strange, our bad URL log looked like this…
Copy code
Starting v2.3.0 agent connected to <http://google.com>...

  ___ ___ ___ ___ ___ ___ _____     _   ___ ___ _  _ _____
 | _ \ _ \ __| __| __/ __|_   _|   /_\ / __| __| \| |_   _|
 |  _/   / _|| _|| _| (__  | |    / _ \ (_ | _|| .` | | |
 |_| |_|_\___|_| |___\___| |_|   /_/ \_\___|___|_|\_| |_|


Agent started! Looking for work from queue(s): main...
(this was on 2.3.0 not sure if that makes a difference)
c
the service manifest should have no relevant, that’s just how the service is being exposed to reach from the outside in
i
Gotcha ok
c
and yes, I receive the same top of stack message with google:
Copy code
Agents now support multiple work queues. Instead of passing a single argument, 
provide work queue names with the `-q` or `--work-queue` flag: `prefect agent 
start -q kubernetes`

Starting v2.3.2 agent connected to <https://google.com>...

  ___ ___ ___ ___ ___ ___ _____     _   ___ ___ _  _ _____
 | _ \ _ \ __| __| __/ __|_   _|   /_\ / __| __| \| |_   _|
 |  _/   / _|| _|| _| (__  | |    / _ \ (_ | _|| .` | | |
 |_| |_|_\___|_| |___\___| |_|   /_/ \_\___|___|_|\_| |_|


Agent started! Looking for work from queue(s): kubernetes...
the bottom of stack is
Copy code
prefect.exceptions.PrefectHTTPStatusError: Client error '404 Not Found' for url '<https://google.com/work_queues/>'
For more information check: <https://httpstatuses.com/404>
I would say if your api key is correct, and your url is correct (you can verify by checking prefect config view if your workspace is set locally on your cli) then your agent is successfully configured
assuming there are no more detailed logs with errors like above
i
Ya, we are definitely not seeing that in our log 😕
We are thinking it might be a DNS issue on the cluster
I wonder if there is an except somewhere in the agent that is preventing that output
Will update as we continue troubleshooting
c
what you can probably try out of curiosity
is shell exec into one of the pods
and curl the api / auth login
i
ah thats not a bad idea
c
see if you can reach it manually from a pod shell
i
We will try this as well, thank you Christopher
1
Will report back
Sorry - do we know that the linux distribution included in the agent has curl?
c
I can check, but if not, it can be added pretty quickly to a new image
Ill check
curl isnt native in the image, but prefect is
you shell in, and do prefect cloud login with your api key
i
ah ok
thank you Christopher
c
via -
prefect cloud login --key  <key>
1
i
Just an update Chris - we tried installing curl on this pod just to check but we’re running into a DNS error which is preventing us from doing so.
Copy code
# apt-get install curl
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package curl
# apt-get update
Err:1 <http://deb.debian.org/debian> bullseye InRelease
  Temporary failure resolving '<http://deb.debian.org|deb.debian.org>'
Which makes us think it’s possible that something in the agent is “eating” a dns-related error which would be surfaced since we’re definitely not connecting to cloud or anything outside the container
c
you can try creating a standalone image outside (no curl , but can use pip to install httpie which is just fancier curl)-
Copy code
cat DOCKERFILE
FROM prefecthq/prefect:2.32.0-python3.10
RUN pip install httpie
then:
Copy code
export IMAGE_REGISTRY="my_image_registry"
export PROJECT_NAME="prefect-httpie-3.10"
export PROJECT_VERSION="latest"

echo "$IMAGE_REGISTRY/$PROJECT_NAME:$PROJECT_VERSION"
docker build --platform=linux/amd64 -t "$IMAGE_REGISTRY/$PROJECT_NAME:$PROJECT_VERSION" -f ./DOCKERFILE .
docker push "$IMAGE_REGISTRY/$PROJECT_NAME:$PROJECT_VERSION"
1
then update your image tags with my_image_registry/prefect-httpie-3.10:latest
👍 1
i
I think we’ll probably run into another dns error here either way, but we will check, thanks!
c
no problem - this would all be local on your system
unless the issue is pulling the image
http usage is functionally the same as curl, just better presentation:
Copy code
http <https://api.prefect.io>
HTTP/1.1 400 Bad Request
Access-Control-Allow-Origin: *
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
Content-Length: 18
Content-Type: text/html; charset=utf-8
Date: Mon, 12 Sep 2022 19:24:13 GMT
ETag: W/"12-7JEJwpG8g89ii7CR/6hhfN27Q+k"
Set-Cookie: route=1663010654.252.42.320963|0e12785838f6d88c8042f8a55371e727; Expires=Wed, 14-Sep-22 19:24:13 GMT; Max-Age=172800; Path=/; Secure; HttpOnly
Strict-Transport-Security: max-age=15724800; includeSubDomains
Via: 1.1 google
X-Powered-By: Express

GET query missing.
i
ya i think the issue here would be pulling
c
then yea, that definitely sounds like dns
1
i
since it doesn’t seem like we can connect to anything outside the container running the agent