Hi Everyone we re using Prefect Server on premise V1 2 2 hos Prefect Community #ask-community

Hi Everyone, we're using Prefect Server on premise...

Omar Sultan

06/15/2022, 5:57 PM

Hi Everyone, we're using Prefect Server on premise V1.2.2 hosted on Kubernetes, and everything is working perfect. However, every now and then we get the occasional error

Copy code

Error during execution of task: ConnectTimeout(MaxRetryError("HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f32b0d48910>, 'Connection to prefect-apollo.prefect timed out. (connect timeout=60)'))"))

This seems to affect all running tasks at the time, so for example if I have 2 or 3 tasks running, they all fail at the exact same time and give this error

Kevin Kho

06/15/2022, 6:00 PM

I think this is a sign your graphql pod is resource constrained. You can try increasing the resources

Omar Sultan

06/15/2022, 10:31 PM

Hi Kevin, would adding more replicas help? As i monitored the container resource usage and they dont seem to be anywhere near the limits of the node they are running on. So i was thinking of maybe adding additional replicas using the helm config , what do you think?

Kevin Kho

06/16/2022, 3:00 AM

Do you have a Flow with a lot of tasks? Or maybe the database has too much data?

Omar Sultan

06/16/2022, 5:58 AM

The biggest flow has around 10-12 Tasks, some of them are tasks to run other flows, so its acting more like an orchestrator. I tried to delete the data completely from the database, so that there is no history at all but the issue is still happening

Kevin Kho

06/16/2022, 2:24 PM

That’s very weird if there’s already no data and requests are taking 60 seconds. Does it happen when you run one flow at a time or happens with concurrency?

Omar Sultan

06/16/2022, 9:00 PM

Most of the time its when there are multiple flows running, but it also happened when there was only just 1 flow running. Do you think replicas for Apollo and Graphql PODS could help?

Omar Sultan

06/22/2022, 4:51 PM

Hi Kevin, just wanted to check if you have any thoughts on the replicas approach

Kevin Kho

06/22/2022, 5:24 PM

Oh sorry for my late response here. I’ve been hesitating because yes you can turn on the number of replicas, but I don’t think they are used automatically. I think you may still have to do additional work, but my Kubernetes is not good enough to guide there

Kevin Kho

06/22/2022, 5:24 PM

But yes turning on the replicas may be a good place to start

Omar Sultan

06/25/2022, 3:46 AM

Thanks a lot Kevin, I will give it a go

Omar Sultan

06/25/2022, 11:41 AM

Adding more replicas and having the K8 Service Load Balance between them reduced the failures but did not eliminate them. Still trying to investigate its a very annoying issue but not impacting our production environment. Will let you know if I manage to find out what the issue is 🙂 Thanks for your help

8 Views

Open in Slack

Previous Next