Hey community! I am seeing weird connection issues...
# ask-community
t
Hey community! I am seeing weird connection issues on Prefect. We are running relatively large flows (~2000 tasks), all of which are
CreateNameSpacedJob
tasks, spawning jobs on our kubernetes cluster. We are hosting our own Prefect server using the docker-compose command directly on a moderately large VM (Standard E8ds v5 VM type from Azure). It seems as if the flow-pod, the pod responsible for orchestrating the tasks, is losing connection to the backend, specifically the
apollo
service as can be seen in the first screenshot. All of a sudden, all
CreateNameSpacedJob
would fail at the same time when the
CloudTaskRunner
went to update the state of the task. I did a bit of digging with
netstat
and it seems that there are quite a bit of TCP connections being created in the
apollo
container, however, I am not entirely sure if that is "business as usual" or a bit on the heavy side for this kind of setup. Have anyone else experience these kinds of hiccups or are using a similar setup that might have ideas? I dont know whether the second screenshot is of relevance but it has started to pop up quite a lot and I cant seem to figure out whats causing it.
Another screenshot of the netstat dump
a
@Thomas Nyegaard-Signori Your issue may occur due to Azure Load Balancer dropping connections in AKS Kubernetes cluster, as described here. The issue was already reported by the community in this thread - sharing in case you want to have a look. Main take-aways: • Azure has given the advice to increase the idle timeout on the load balancer but according to our community member, this solutions did not fix the problem, and the connection reset was still appearing after roughly 4 minutes • This PR #5066 was addressing the problem and got released in 0.15.8 - what Prefect version do you use?
t
Hey Anna, nice with some reading material to dig into! We are currently on
0.15.9
, so that should already be implemented then. Ill try to bump the timeout, just for safetys sake. I have also experienced other networking issues, mainly heartbeats losing connection and marking running tasks as failed. Have you heard anything about which network type to use on AKS? Currently we are using the
Azure CNI
, but we have heard about some bad experiences internally when having a lot of pod-to-pod traffic.
a
sorry, not too familiar with Azure networking (yet). Regarding heartbeats losing connection, this usually happens with not enough memory allocated to Kubernetes jobs or with long running jobs. Switching to threads instead of processes often helps, you could try adding this env variable to your run config:
Copy code
from prefect.run_configs import UniversalRun
flow.run_config = UniversalRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})
t
Okay, ill see if changing to
thread
could help alleviate it. The k8s job is quite heavily overallocated because of some errors we saw earlier with Dask executors, so around 25 GB of RAM and 2 cores.
a
I see. LMK once you tried it 👍
🙏 1
t
Hey @Anna Geller, sadly the
thread
option didnt alleviate the problem. Im wondering if its just too much network traffic between pods for our setup to handle, I got some advice about using the
calico
network policy instead of the default on AKS, lets see if that helps
a
thanks for getting back on this. To be honest, Im not sure if changing a network policy to
calico
can help here because network policies are more for improving security than for performance, right? if anything this would restrict the traffic for security? But I’m not that familiar with that. From previous community members it seemed that the AKS load balancer was the culprit so I’d definitely cross-check those settings. Btw, we have an internal issue open regarding the lost flow heartbeats causing the flow run to be left in a Running state, so this is on our radar.
t
Cool, thanks for the update. Yeah, I cant say exactly why this should be helpful, our internal consultant said something about it being more lightweight when allocating network resources but thats getting a little bit above my paygrade. Ill experiment with it.
🙌 1