I have a flow run that I think has failed-ish - ho...
# prefect-server
c
I have a flow run that I think has failed-ish - hop in the thread to check out the Prefect Cloud log message - can you help me understand what the message it telling me?
Copy code
04:24:37
WARNING
CloudFlowRunner
Error getting flow run info
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connection.py", line 416, in connect
    self.sock = ssl_wrap_socket(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/home/ec2-user/SageMaker/.pyenv/versions/3.8.6/lib/python3.8/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/home/ec2-user/SageMaker/.pyenv/versions/3.8.6/lib/python3.8/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/home/ec2-user/SageMaker/.pyenv/versions/3.8.6/lib/python3.8/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:1124)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  [Previous line repeated 3 more times]
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Max retries exceeded with url: /graphql (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1124)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/engine/cloud/flow_runner.py", line 188, in interrupt_if_cancelling
    flow_run_info = self.client.get_flow_run_info(flow_run_id)
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/client/client.py", line 1145, in get_flow_run_info
    result = self.graphql(query).data.flow_run_by_pk  # type: ignore
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/client/client.py", line 298, in graphql
    result = <http://self.post|self.post>(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/client/client.py", line 213, in post
    response = self._request(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/client/client.py", line 459, in _request
    response = self._send_request(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/client/client.py", line 351, in _send_request
    response = <http://session.post|session.post>(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/requests/sessions.py", line 590, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Max retries exceeded with url: /graphql (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1124)')))
k
Hey, I believe this is related to a slight outage that we had on Oct 7 5:20-something ET that lasted for maybe 30 seconds. The certificates have been changed and this is not expected going forward.
m
What’s the reason this could happen in general? I’m getting the similar error that doesn’t make my flow finish, but just stuck running forever. Here’s the log?
Copy code
[2021-10-10 19:10:51+0000] WARNING - prefect.CloudFlowRunner | Error getting flow run info
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.7/http/client.py", line 1373, in getresponse
    response.begin()
  File "/usr/local/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 532, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 447, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=15)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/cloud/flow_runner.py", line 188, in interrupt_if_cancelling
    flow_run_info = self.client.get_flow_run_info(flow_run_id)
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 1562, in get_flow_run_info
    result = self.graphql(query).data.flow_run_by_pk  # type: ignore
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 554, in graphql
    retry_on_api_error=retry_on_api_error,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 458, in post
    retry_on_api_error=retry_on_api_error,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 738, in _request
    session=session, method=method, url=url, params=params, headers=headers
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 606, in _send_request
    timeout=prefect.context.config.cloud.request_timeout,
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 590, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
I have started having situations like these quite often.. The flow executes only 3
RunNamespacedJob
tasks in parallel. The first two finish and the last one doesn’t get a confirmation that is done from time to time and gets stuck, even though the underlying k8s job finished. Maybe I need to increase this read timeout?
k
This is server right? We can try increasing the read timeout like you suggested. Have you seen how? I can look if you’re not familiar
m
Yep, this is on server 🙂 I also didn’t check how to do it yet 🙂 So if you already know where to look, it would be great 🙂
k
Yeah one sec
m
🙏
k
Check this . You need to set
prefect.context.config.cloud.request_timeout
m
Thanks! So I guess I can set this as an env to the k8s job definition for the k8s run config? For this it would be then
PREFECT__CONTEXT__CONFIG__CLOUD__REQUEST_TIMEOUT
? Sorry I haven’t set any of these so far 🙂 I’m not sure if I should put
CONFIG
, this doc is a bit ambiguous 🙂
k
that looks right to me
m
I’ll try it tomorrow and let you know if it worked 🙂 Thanks again!
👍 1
I looked how the env vars are set in official prefect helm chart and deducted that this one should be
PREFECT__CLOUD__REQUEST_TIMEOUT
🙂 I tested it and the value is set correctly. I’ll see if it helps to mitigate the problem 🙂
k
Oh ok
m
In the end, upon inspecting the code, I concluded that this timeout is not the cause of the problem. The logs above are just warnings and don’t cause the flow to be stuck. In this particular scenario, the flow was stuck in running state, because the
RunNamespacedJob
was stuck in running state even though the underlying pod finished long time ago. And these warnings started coming like 5 hours after that, so it’s not the cause of an issue.
Any other ides why this task might get stuck in the running state?
k
2 ideas. I saw one person had an issue with unclosed connections. second, does it fail on a task mapped over Dask?
m
We use
LocalDaskExecutor("processes")
for executing these 3
RunNamespacedJobs
tasks in parallel 🙂 Do you know how I could check this (unclosed connections)? Thanks 🙂
k
I do not. But there was a guy in community who said he had containers not shutting down, and then he found out it was a snowflake connector that didn’t close in the code base. So I guess you need to check like that?
m
Hmmm 🙂 I’ll check it out. We do access Snowflake in these containers. But on the other hand, the k8s jobs finish (and pods get deleted, meaning the containers shut down), just the the Prefect flow k8s runner job doesn’t (and it still sees one task as a running one)…