Jason Bertman
03/23/2022, 5:26 PMTraceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 91, in call_runner_target_handlers
state = self.client.set_task_run_state(
File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1922, in set_task_run_state
result = self.graphql(
...
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='prefecthq-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f12486259d0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
This error would seem to indicate that the service can't even be resolved, but this DNS name is perfectly resolvable from the dask pods, and many tasks succeed before a couple fail. I see signs of distress from the apollo service in the form of:
│ BadRequestError: request aborted │
│ at IncomingMessage.onAborted (/apollo/node_modules/raw-body/index.js:231:10) │
│ at IncomingMessage.emit (events.js:315:20) │
│ at abortIncoming (_http_server.js:561:9) │
│ at socketOnClose (_http_server.js:554:3) │
│ at Socket.emit (events.js:327:22) │
│ at TCP.<anonymous> (net.js:673:12)
repeated pretty much as many times as the dask pod tries it. Light research on this case points to the service being overwhelmed. Is this case covered by the request retrier here? https://github.com/PrefectHQ/prefect/blob/a2041c7ff1a619e611f614ed3394fccd05bb2005/src/prefect/client/client.py#L633
If not, what's the best way to handle this case? Any configuration changes I could be making the Apollo pod?Kevin Kho
Jason Bertman
03/23/2022, 5:48 PMKevin Kho
Jason Bertman
03/23/2022, 5:53 PMresources:
requests:
cpu: "0.1"
memory: "512Mi"
limits:
cpu: "2"
memory: "4Gi"
Kevin Kho
Jason Bertman
03/23/2022, 5:55 PMKevin Kho
Jason Bertman
03/23/2022, 5:58 PMKevin Kho
Jason Bertman
03/23/2022, 6:05 PMKevin Kho
Jason Bertman
03/23/2022, 6:08 PMKevin Kho
Jason Bertman
03/23/2022, 6:10 PMKevin Kho
Jason Bertman
03/23/2022, 6:13 PMKevin Kho
Jason Bertman
03/23/2022, 6:16 PMKevin Kho
Jason Bertman
03/23/2022, 6:55 PMconnect
option:
connect (int) –
How many connection-related errors to retry on.
These are errors raised before the request is sent to the remote server, which we assume has not triggered the server to process the request.
Set to 0 to fail on the first retry of this type.
this appears to be set to 0 by default, so perhaps adding a connect
kwarg here would help to resolve the issue (I have not tested this yet, but may give it a shot)Kevin Kho
Jason Bertman
03/24/2022, 12:40 PMconnect
kwarg. I figured it was still an issue, or it wasn't retrying, so I dug into the urllib3 code a bit more.
Starting at the initial stack trace, the exception being raised is a ConnectionError
. I (incorrectly) assumed this was covered by the connect
case. A bit counterintuitively perhaps, the read
kwarg catches ProtocolError
: https://github.com/urllib3/urllib3/blob/37a67739ba9a7835886555744e9dabe3009e538a/src/urllib3/util/retry.py#L366
Which I found to be aliased to `ConnectionError`: https://github.com/urllib3/urllib3/blob/9fbab29644bfc8068eb2082d0f25726c91813337/src/urllib3/exceptions.py#L84
meaning we should be using read
, not connect
. The flow is still running, and has gotten past where it has failed before. Hoping it's not just dumb luck 🤞Kevin Kho
read
kwarg helps here? I can open the issue in a bit to document this. This sounds hard to know if it will take effectJason Bertman
03/24/2022, 3:23 PMKevin Kho
Jason Bertman
03/24/2022, 3:34 PMKevin Kho