Hi Prefecters, I'm getting some flows stuck in sub...
# prefect-server
a
Hi Prefecters, I'm getting some flows stuck in submitted state, perhaps one in ten or so. I can't see a solution here in the Slack. I'm running Prefect server in Kubernetes with a Kubernetes agent, more details in thread.
I can see a bunch of exceptions in the agent like this
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='<http://XX.XXX.XX.XXX|XX.XXX.XX.XXX>', port=443): Read timed out. (read timeout=None)
The IP there is actual the IP of the kubernetes service itself in the default namespace.
The original line in the stack trace is
Copy code
File "/usr/local/.venv/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 433, in deploy_flow
    self.batch_client.create_namespaced_job(
k
Oh man was just searching and came across this. Do feel free to ping me if I don’t get back to you in a day. Did you get past this?
Is that an apollo timeout our k8s api timeout?
a
I still see them intermittently, the last one was a few hours ago. How can I tell if it's Apollo or if it's k8s?
k
Do your error look like his here
a
Hmmm, not like that, stack trace below;
Copy code
[2021-10-12 01:05:00,139] WARNING - Prefect-Kubed | Error submitting job prefect-job-f48950af, retrying...
Traceback (most recent call last):
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 445, in _make_request
WARNING:Prefect-Kubed:Error submitting job prefect-job-f48950af, retrying...
Traceback (most recent call last):
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/.venv/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 433, in deploy_flow
    self.batch_client.create_namespaced_job(
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/api/batch_v1_api.py", line 66, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/api/batch_v1_api.py", line 161, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request
    return <http://self.rest_client.POST|self.rest_client.POST>(url,
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/rest.py", line 274, in POST
    return self.request("POST", url,
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/rest.py", line 167, in request
    r = self.pool_manager.request(
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/request.py", line 78, in request
    return self.request_encode_body(
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/request.py", line 170, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/util/retry.py", line 532, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
  File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/.venv/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 433, in deploy_flow
    self.batch_client.create_namespaced_job(
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/api/batch_v1_api.py", line 66, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/api/batch_v1_api.py", line 161, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request
    return <http://self.rest_client.POST|self.rest_client.POST>(url,
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/rest.py", line 274, in POST
    return self.request("POST", url,
  File "/usr/local/.venv/lib/python3.8/site-packages/kubernetes/client/rest.py", line 167, in request
    r = self.pool_manager.request(
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/request.py", line 78, in request
    return self.request_encode_body(
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/request.py", line 170, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/util/retry.py", line 532, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
k
Looks like the k8s job is failing. Hard to say why. Do you have enough resources? Does the pod start but then this happens? It could be a situation like this ? Is it the same flow that errors out?
a
I'm using Prefect Server, so all the comms between components are within the one namespace in Kubernetes. My reading of that error you posted would look like using Prefect Cloud trying to reach back into a K8S cluster, is that right?
k
The guy I linked and asked if your logs matched his is on server as well. If you’re talking about this StackOverflow post, I was more of asking if there was something being killed potentially due to going idle?
I’ll ping a heavy k8s user on the Prefect team and try to get some insight
👍 1
j
Hi @Aiden Price, I was stumbling across the exact same issue in the last days, some tasks were stuck in submitted state for 15 minutes until they were finally picked up by Lazarus. Are you running the kubernetes cluster on AKS? Then it could be related to the Azure Loadbalancer dropping connections as described here: https://github.com/PrefectHQ/prefect/pull/3344#issuecomment-696643851
a
Yes, I am running on AKS. So that applies to internal cluster comms between different services in the same namespace? The AKS issue referred to in the Prefect issue says it's fixed but I'm still getting the issue in my cluster.
j
Azure has given the advice to increase the idle timeout on the loadbalancer (by default 4 minutes, can be increased up to 30 mins), however I couldn't see any effect, the connection reset was always appearing after 4 minutes and few seconds (inspected the packets with tcpdump).
Copy code
import logging
import socket
import time

from kubernetes import config
from urllib3.connection import HTTPConnection, HTTPSConnection

# socket.setdefaulttimeout(10)

logging.basicConfig(
    format='%(asctime)s %(levelname)-8s %(message)s',
    level=logging.DEBUG)

socket_options = [
    (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
    (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 30),
    (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 6),
    (socket.IPPROTO_TCP, 0x10, 30),
]

try:
    socket_options.append((socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 120))
except:
    pass

HTTPSConnection.default_socket_options = HTTPSConnection.default_socket_options + socket_options
HTTPConnection.default_socket_options = HTTPConnection.default_socket_options + socket_options

config.load_kube_config()

from kubernetes import client

v1 = client.CoreV1Api()

while True:
    <http://logging.info|logging.info>('Listing config maps')
    v1.list_namespaced_config_map('default', _request_timeout=15)
    <http://logging.info|logging.info>('OK')
    <http://logging.info|logging.info>('Sleeping 6 minutes')
    time.sleep(360)
This is the script I used for testing. Without the
socket_options
I can see a retry on the next request due to the ReadTimeoutError. With the
socket_options
a keep-alive connection is established, so the connection won't be closed by the remote server. I just created a PR to fix this: https://github.com/PrefectHQ/prefect/pull/5066
k
Wow thanks for the PR. The core team will look at it.
a
Nice work, very community focussed, well done.
👍 2