https://prefect.io logo
Title
o

Omar Sultan

06/15/2022, 5:57 PM
Hi Everyone, we're using Prefect Server on premise V1.2.2 hosted on Kubernetes, and everything is working perfect. However, every now and then we get the occasional error
Error during execution of task: ConnectTimeout(MaxRetryError("HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f32b0d48910>, 'Connection to prefect-apollo.prefect timed out. (connect timeout=60)'))"))
This seems to affect all running tasks at the time, so for example if I have 2 or 3 tasks running, they all fail at the exact same time and give this error
k

Kevin Kho

06/15/2022, 6:00 PM
I think this is a sign your graphql pod is resource constrained. You can try increasing the resources
o

Omar Sultan

06/15/2022, 10:31 PM
Hi Kevin, would adding more replicas help? As i monitored the container resource usage and they dont seem to be anywhere near the limits of the node they are running on. So i was thinking of maybe adding additional replicas using the helm config , what do you think?
k

Kevin Kho

06/16/2022, 3:00 AM
Do you have a Flow with a lot of tasks? Or maybe the database has too much data?
o

Omar Sultan

06/16/2022, 5:58 AM
The biggest flow has around 10-12 Tasks, some of them are tasks to run other flows, so its acting more like an orchestrator. I tried to delete the data completely from the database, so that there is no history at all but the issue is still happening
k

Kevin Kho

06/16/2022, 2:24 PM
That’s very weird if there’s already no data and requests are taking 60 seconds. Does it happen when you run one flow at a time or happens with concurrency?
o

Omar Sultan

06/16/2022, 9:00 PM
Most of the time its when there are multiple flows running, but it also happened when there was only just 1 flow running. Do you think replicas for Apollo and Graphql PODS could help?
Hi Kevin, just wanted to check if you have any thoughts on the replicas approach
k

Kevin Kho

06/22/2022, 5:24 PM
Oh sorry for my late response here. I’ve been hesitating because yes you can turn on the number of replicas, but I don’t think they are used automatically. I think you may still have to do additional work, but my Kubernetes is not good enough to guide there
But yes turning on the replicas may be a good place to start
o

Omar Sultan

06/25/2022, 3:46 AM
Thanks a lot Kevin, I will give it a go
Adding more replicas and having the K8 Service Load Balance between them reduced the failures but did not eliminate them. Still trying to investigate its a very annoying issue but not impacting our production environment. Will let you know if I manage to find out what the issue is 🙂 Thanks for your help