I have a flow run that I think has failed ish hop in the thr Prefect Community #prefect-server

I have a flow run that I think has failed-ish - ho...

Cooper Marcus

10/09/2021, 3:20 PM

I have a flow run that I think has failed-ish - hop in the thread to check out the Prefect Cloud log message - can you help me understand what the message it telling me?

Cooper Marcus

10/09/2021, 3:20 PM

Copy code

04:24:37
WARNING
CloudFlowRunner
Error getting flow run info
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connection.py", line 416, in connect
    self.sock = ssl_wrap_socket(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/home/ec2-user/SageMaker/.pyenv/versions/3.8.6/lib/python3.8/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/home/ec2-user/SageMaker/.pyenv/versions/3.8.6/lib/python3.8/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/home/ec2-user/SageMaker/.pyenv/versions/3.8.6/lib/python3.8/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:1124)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  [Previous line repeated 3 more times]
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Max retries exceeded with url: /graphql (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1124)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/engine/cloud/flow_runner.py", line 188, in interrupt_if_cancelling
    flow_run_info = self.client.get_flow_run_info(flow_run_id)
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/client/client.py", line 1145, in get_flow_run_info
    result = self.graphql(query).data.flow_run_by_pk  # type: ignore
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/client/client.py", line 298, in graphql
    result = <http://self.post|self.post>(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/client/client.py", line 213, in post
    response = self._request(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/client/client.py", line 459, in _request
    response = self._send_request(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/prefect/client/client.py", line 351, in _send_request
    response = <http://session.post|session.post>(
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/requests/sessions.py", line 590, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/home/ec2-user/SageMaker/PGE-Dx-Risk/.venv/lib/python3.8/site-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Max retries exceeded with url: /graphql (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1124)')))

Kevin Kho

10/09/2021, 3:40 PM

Hey, I believe this is related to a slight outage that we had on Oct 7 5:20-something ET that lasted for maybe 30 seconds. The certificates have been changed and this is not expected going forward.

Marko Jamedzija

10/11/2021, 4:33 PM

What’s the reason this could happen in general? I’m getting the similar error that doesn’t make my flow finish, but just stuck running forever. Here’s the log?

Copy code

[2021-10-10 19:10:51+0000] WARNING - prefect.CloudFlowRunner | Error getting flow run info
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.7/http/client.py", line 1373, in getresponse
    response.begin()
  File "/usr/local/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 532, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 447, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=15)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/cloud/flow_runner.py", line 188, in interrupt_if_cancelling
    flow_run_info = self.client.get_flow_run_info(flow_run_id)
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 1562, in get_flow_run_info
    result = self.graphql(query).data.flow_run_by_pk  # type: ignore
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 554, in graphql
    retry_on_api_error=retry_on_api_error,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 458, in post
    retry_on_api_error=retry_on_api_error,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 738, in _request
    session=session, method=method, url=url, params=params, headers=headers
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 606, in _send_request
    timeout=prefect.context.config.cloud.request_timeout,
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 590, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)

I have started having situations like these quite often.. The flow executes only 3

RunNamespacedJob

tasks in parallel. The first two finish and the last one doesn’t get a confirmation that is done from time to time and gets stuck, even though the underlying k8s job finished. Maybe I need to increase this read timeout?

Kevin Kho

10/11/2021, 4:41 PM

This is server right? We can try increasing the read timeout like you suggested. Have you seen how? I can look if you’re not familiar

Marko Jamedzija

10/11/2021, 4:42 PM

Yep, this is on server 🙂 I also didn’t check how to do it yet 🙂 So if you already know where to look, it would be great 🙂

Kevin Kho

10/11/2021, 4:43 PM

Yeah one sec

Marko Jamedzija

10/11/2021, 4:43 PM

🙏

Kevin Kho

10/11/2021, 4:47 PM

Check this . You need to set

prefect.context.config.cloud.request_timeout

Marko Jamedzija

10/11/2021, 4:55 PM

Thanks! So I guess I can set this as an env to the k8s job definition for the k8s run config? For this it would be then

PREFECT__CONTEXT__CONFIG__CLOUD__REQUEST_TIMEOUT

? Sorry I haven’t set any of these so far 🙂 I’m not sure if I should put

CONFIG

, this doc is a bit ambiguous 🙂

Kevin Kho

10/11/2021, 4:55 PM

that looks right to me

Marko Jamedzija

10/11/2021, 4:56 PM

I’ll try it tomorrow and let you know if it worked 🙂 Thanks again!

👍 1

Marko Jamedzija

10/12/2021, 11:36 AM

I looked how the env vars are set in official prefect helm chart and deducted that this one should be

PREFECT__CLOUD__REQUEST_TIMEOUT

🙂 I tested it and the value is set correctly. I’ll see if it helps to mitigate the problem 🙂

Kevin Kho

10/12/2021, 12:44 PM

Oh ok

Marko Jamedzija

10/12/2021, 2:52 PM

In the end, upon inspecting the code, I concluded that this timeout is not the cause of the problem. The logs above are just warnings and don’t cause the flow to be stuck. In this particular scenario, the flow was stuck in running state, because the

RunNamespacedJob

was stuck in running state even though the underlying pod finished long time ago. And these warnings started coming like 5 hours after that, so it’s not the cause of an issue.

Marko Jamedzija

10/12/2021, 2:53 PM

Any other ides why this task might get stuck in the running state?

Kevin Kho

10/12/2021, 4:09 PM

2 ideas. I saw one person had an issue with unclosed connections. second, does it fail on a task mapped over Dask?

Marko Jamedzija

10/13/2021, 10:42 AM

We use

LocalDaskExecutor("processes")

for executing these 3

RunNamespacedJobs

tasks in parallel 🙂 Do you know how I could check this (unclosed connections)? Thanks 🙂

Kevin Kho

10/13/2021, 4:59 PM

I do not. But there was a guy in community who said he had containers not shutting down, and then he found out it was a snowflake connector that didn’t close in the code base. So I guess you need to check like that?

Marko Jamedzija

10/14/2021, 8:56 AM

Hmmm 🙂 I’ll check it out. We do access Snowflake in these containers. But on the other hand, the k8s jobs finish (and pods get deleted, meaning the containers shut down), just the the Prefect flow k8s runner job doesn’t (and it still sees one task as a running one)…

3 Views

Open in Slack

Previous Next