Hey community I am currently trying to deploy a K8s Orion se Prefect Community #ask-community

Hey community! I am currently trying to deploy a K...

Robin Weiß

07/04/2022, 1:20 PM

Hey community! I am currently trying to deploy a K8s Orion setup. After quite some starting problems, I thought I had finally made it. Unfortunately, now I see very weird behaviour: • The agent pod keeps restarting in a CrashLoop. The error message is very lengthy HTTP Read Timeout. Abbreviated message is

Copy code

...
File "/usr/local/lib/python3.9/site-packages/prefect/client.py", line 834, in read_work_queue_by_name
...
httpx.ReadTimeout
An exception occurred.

• The agent gives these weird log messages:

Copy code

MarkLateRuns took 26.306307 seconds to run, which is longer than its loop interval of 5.0 seconds.
FlowRunNotifications took 30.444981 seconds to run, which is longer than its loop interval of 4 seconds.
MarkLateRuns took 30.619028 seconds to run, which is longer than its loop interval of 5.0 seconds.

My guess is that something is really slowing the container down so that it runs into connection timeout issues as it doesn’t reply in time. Does anyone have any idea where to look further? The error message unfortunately gives me zero insights on the matter 😞 Thanks!

✅ 1

Anna Geller

07/04/2022, 1:28 PM

CrashLoop is often when you can't pull the image to point you to some resources you can try: • helm chart for Orion https://github.com/PrefectHQ/prefect-helm/tree/main/charts/prefect-orion • helm chart for agent https://github.com/PrefectHQ/prefect-helm/tree/main/charts/prefect-agent • list of self-hosted resources https://discourse.prefect.io/t/how-to-self-host-prefect-2-0-orchestration-layer-list-of-resources-to-get-started/952 Generally, for self-hosted deployments, we would like to see community contributions. You could share your setup as a tutorial and perhaps this way some user who did that can point you to what's missing?

Robin Weiß

07/04/2022, 1:30 PM

Thanks for your reply! I think the images must have been pulled correctly because the Prefect containers are all up and running. The crash happens after a few minutes only. I see the logs of the API and agent before it breaks. Any clue as to what the super long times for the

MarkLateRuns

and

FlowRunNotifications

could mean? 🤔

Anna Geller

07/04/2022, 2:44 PM

Maybe some late flow runs? Can you inspect the state of your work queue?

Robin Weiß

07/04/2022, 3:19 PM

Unfortunately, also the Prefect CLI keeps throwing very weird and extremely long HTTP error stack traces 😅

jawnsy

07/04/2022, 4:06 PM

Possibly a network connectivity issue between the agent and the orion server? Did the orion server pod start correctly/is it marked ready, can you hit the API from your local machine, do you have network policies or anything else in the cluster that might interfere with traffic?

🙏 1

Robin Weiß

07/04/2022, 4:19 PM

Hey! The really interesting thing is that it’s not consistent. It will work most of the times, but every few minutes, this error causes the Pod to crash and restart

Robin Weiß

07/04/2022, 4:19 PM

So I’m fairly sure basic connectivity works since the whole setup works in general, just not reliably over time

jawnsy

07/04/2022, 4:22 PM

based on your initial thought of a CPU throttling problem, have you tried increasing the CPU requests, so that k8s guarantees more minimum capacity?

7 Views

Open in Slack

Previous Next