Hi Everyone, we're using Prefect Server on premise...
# prefect-community
o
Hi Everyone, we're using Prefect Server on premise V1.2.2 hosted on Kubernetes, and everything is working perfect. However, every now and then we get the occasional error
Copy code
Error during execution of task: ConnectTimeout(MaxRetryError("HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f32b0d48910>, 'Connection to prefect-apollo.prefect timed out. (connect timeout=60)'))"))
This seems to affect all running tasks at the time, so for example if I have 2 or 3 tasks running, they all fail at the exact same time and give this error
k
I think this is a sign your graphql pod is resource constrained. You can try increasing the resources
o
Hi Kevin, would adding more replicas help? As i monitored the container resource usage and they dont seem to be anywhere near the limits of the node they are running on. So i was thinking of maybe adding additional replicas using the helm config , what do you think?
k
Do you have a Flow with a lot of tasks? Or maybe the database has too much data?
o
The biggest flow has around 10-12 Tasks, some of them are tasks to run other flows, so its acting more like an orchestrator. I tried to delete the data completely from the database, so that there is no history at all but the issue is still happening
k
That’s very weird if there’s already no data and requests are taking 60 seconds. Does it happen when you run one flow at a time or happens with concurrency?
o
Most of the time its when there are multiple flows running, but it also happened when there was only just 1 flow running. Do you think replicas for Apollo and Graphql PODS could help?
Hi Kevin, just wanted to check if you have any thoughts on the replicas approach
k
Oh sorry for my late response here. I’ve been hesitating because yes you can turn on the number of replicas, but I don’t think they are used automatically. I think you may still have to do additional work, but my Kubernetes is not good enough to guide there
But yes turning on the replicas may be a good place to start
o
Thanks a lot Kevin, I will give it a go
Adding more replicas and having the K8 Service Load Balance between them reduced the failures but did not eliminate them. Still trying to investigate its a very annoying issue but not impacting our production environment. Will let you know if I manage to find out what the issue is 🙂 Thanks for your help