Hey community! I am currently trying to deploy a K...
# prefect-community
r
Hey community! I am currently trying to deploy a K8s Orion setup. After quite some starting problems, I thought I had finally made it. Unfortunately, now I see very weird behaviour: • The agent pod keeps restarting in a CrashLoop. The error message is very lengthy HTTP Read Timeout. Abbreviated message is
Copy code
...
File "/usr/local/lib/python3.9/site-packages/prefect/client.py", line 834, in read_work_queue_by_name
...
httpx.ReadTimeout
An exception occurred.
• The agent gives these weird log messages:
Copy code
MarkLateRuns took 26.306307 seconds to run, which is longer than its loop interval of 5.0 seconds.
FlowRunNotifications took 30.444981 seconds to run, which is longer than its loop interval of 4 seconds.
MarkLateRuns took 30.619028 seconds to run, which is longer than its loop interval of 5.0 seconds.
My guess is that something is really slowing the container down so that it runs into connection timeout issues as it doesn’t reply in time. Does anyone have any idea where to look further? The error message unfortunately gives me zero insights on the matter 😞 Thanks!
1
a
CrashLoop is often when you can't pull the image to point you to some resources you can try: • helm chart for Orion https://github.com/PrefectHQ/prefect-helm/tree/main/charts/prefect-orion • helm chart for agent https://github.com/PrefectHQ/prefect-helm/tree/main/charts/prefect-agent • list of self-hosted resources https://discourse.prefect.io/t/how-to-self-host-prefect-2-0-orchestration-layer-list-of-resources-to-get-started/952 Generally, for self-hosted deployments, we would like to see community contributions. You could share your setup as a tutorial and perhaps this way some user who did that can point you to what's missing?
r
Thanks for your reply! I think the images must have been pulled correctly because the Prefect containers are all up and running. The crash happens after a few minutes only. I see the logs of the API and agent before it breaks. Any clue as to what the super long times for the
MarkLateRuns
and
FlowRunNotifications
could mean? 🤔
a
Maybe some late flow runs? Can you inspect the state of your work queue?
r
Unfortunately, also the Prefect CLI keeps throwing very weird and extremely long HTTP error stack traces 😅
j
Possibly a network connectivity issue between the agent and the orion server? Did the orion server pod start correctly/is it marked ready, can you hit the API from your local machine, do you have network policies or anything else in the cluster that might interfere with traffic?
🙏 1
r
Hey! The really interesting thing is that it’s not consistent. It will work most of the times, but every few minutes, this error causes the Pod to crash and restart
So I’m fairly sure basic connectivity works since the whole setup works in general, just not reliably over time
j
based on your initial thought of a CPU throttling problem, have you tried increasing the CPU requests, so that k8s guarantees more minimum capacity?