Hey community! I am seeing weird connection issues on Prefect. We are running relatively large flows (~2000 tasks), all of which are
CreateNameSpacedJob
tasks, spawning jobs on our kubernetes cluster. We are hosting our own Prefect server using the docker-compose command directly on a moderately large VM (Standard E8ds v5 VM type from Azure).
It seems as if the flow-pod, the pod responsible for orchestrating the tasks, is losing connection to the backend, specifically the
apollo
service as can be seen in the first screenshot. All of a sudden, all
CreateNameSpacedJob
would fail at the same time when the
CloudTaskRunner
went to update the state of the task. I did a bit of digging with
netstat
and it seems that there are quite a bit of TCP connections being created in the
apollo
container, however, I am not entirely sure if that is "business as usual" or a bit on the heavy side for this kind of setup. Have anyone else experience these kinds of hiccups or are using a similar setup that might have ideas? I dont know whether the second screenshot is of relevance but it has started to pop up quite a lot and I cant seem to figure out whats causing it.