Hey community I am seeing weird connection issues on Prefect Prefect Community #ask-community

Hey community! I am seeing weird connection issues...

Thomas Nyegaard-Signori

12/06/2021, 2:03 PM

Hey community! I am seeing weird connection issues on Prefect. We are running relatively large flows (~2000 tasks), all of which are

CreateNameSpacedJob

tasks, spawning jobs on our kubernetes cluster. We are hosting our own Prefect server using the docker-compose command directly on a moderately large VM (Standard E8ds v5 VM type from Azure). It seems as if the flow-pod, the pod responsible for orchestrating the tasks, is losing connection to the backend, specifically the

apollo

service as can be seen in the first screenshot. All of a sudden, all

CreateNameSpacedJob

would fail at the same time when the

CloudTaskRunner

went to update the state of the task. I did a bit of digging with

netstat

and it seems that there are quite a bit of TCP connections being created in the

apollo

container, however, I am not entirely sure if that is "business as usual" or a bit on the heavy side for this kind of setup. Have anyone else experience these kinds of hiccups or are using a similar setup that might have ideas? I dont know whether the second screenshot is of relevance but it has started to pop up quite a lot and I cant seem to figure out whats causing it.

Thomas Nyegaard-Signori

12/06/2021, 2:31 PM

Another screenshot of the netstat dump

Anna Geller

12/06/2021, 2:38 PM

@Thomas Nyegaard-Signori Your issue may occur due to Azure Load Balancer dropping connections in AKS Kubernetes cluster, as described here. The issue was already reported by the community in this thread - sharing in case you want to have a look. Main take-aways: • Azure has given the advice to increase the idle timeout on the load balancer but according to our community member, this solutions did not fix the problem, and the connection reset was still appearing after roughly 4 minutes • This PR #5066 was addressing the problem and got released in 0.15.8 - what Prefect version do you use?

Thomas Nyegaard-Signori

12/06/2021, 2:47 PM

Hey Anna, nice with some reading material to dig into! We are currently on

0.15.9

, so that should already be implemented then. Ill try to bump the timeout, just for safetys sake. I have also experienced other networking issues, mainly heartbeats losing connection and marking running tasks as failed. Have you heard anything about which network type to use on AKS? Currently we are using the

Azure CNI

, but we have heard about some bad experiences internally when having a lot of pod-to-pod traffic.

Anna Geller

12/06/2021, 2:54 PM

sorry, not too familiar with Azure networking (yet). Regarding heartbeats losing connection, this usually happens with not enough memory allocated to Kubernetes jobs or with long running jobs. Switching to threads instead of processes often helps, you could try adding this env variable to your run config:

Copy code

from prefect.run_configs import UniversalRun
flow.run_config = UniversalRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})

Thomas Nyegaard-Signori

12/06/2021, 2:58 PM

Okay, ill see if changing to

thread

could help alleviate it. The k8s job is quite heavily overallocated because of some errors we saw earlier with Dask executors, so around 25 GB of RAM and 2 cores.

Anna Geller

12/06/2021, 3:07 PM

I see. LMK once you tried it 👍

🙏 1

Thomas Nyegaard-Signori

12/07/2021, 10:57 AM

Hey @Anna Geller, sadly the

thread

option didnt alleviate the problem. Im wondering if its just too much network traffic between pods for our setup to handle, I got some advice about using the

calico

network policy instead of the default on AKS, lets see if that helps

Anna Geller

12/07/2021, 11:10 AM

thanks for getting back on this. To be honest, Im not sure if changing a network policy to

calico

can help here because network policies are more for improving security than for performance, right? if anything this would restrict the traffic for security? But I’m not that familiar with that. From previous community members it seemed that the AKS load balancer was the culprit so I’d definitely cross-check those settings. Btw, we have an internal issue open regarding the lost flow heartbeats causing the flow run to be left in a Running state, so this is on our radar.

Thomas Nyegaard-Signori

12/07/2021, 11:45 AM

Cool, thanks for the update. Yeah, I cant say exactly why this should be helpful, our internal consultant said something about it being more lightweight when allocating network resources but thats getting a little bit above my paygrade. Ill experiment with it.

🙌 1

80 Views

Open in Slack

Previous Next