Hey Community, Having an issue with the Apollo se...
# prefect-community
j
Hey Community, Having an issue with the Apollo service in an EKS cluster. It works pretty well, until ~500-1000 mapped tasks are trying to update their task_run_state. The problem is that if even one fails, it gets stuck in a pending state and the flow just hangs. Eventually, the prefect-job pod exits and things continue to go sideways from there.
Copy code
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 91, in call_runner_target_handlers
    state = self.client.set_task_run_state(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1922, in set_task_run_state
    result = self.graphql(
...
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='prefecthq-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f12486259d0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
This error would seem to indicate that the service can't even be resolved, but this DNS name is perfectly resolvable from the dask pods, and many tasks succeed before a couple fail. I see signs of distress from the apollo service in the form of:
Copy code
│ BadRequestError: request aborted                                                                                                                                                                                                           │
│     at IncomingMessage.onAborted (/apollo/node_modules/raw-body/index.js:231:10)                                                                                                                                                           │
│     at IncomingMessage.emit (events.js:315:20)                                                                                                                                                                                             │
│     at abortIncoming (_http_server.js:561:9)                                                                                                                                                                                               │
│     at socketOnClose (_http_server.js:554:3)                                                                                                                                                                                               │
│     at Socket.emit (events.js:327:22)                                                                                                                                                                                                      │
│     at TCP.<anonymous> (net.js:673:12)
repeated pretty much as many times as the dask pod tries it. Light research on this case points to the service being overwhelmed. Is this case covered by the request retrier here? https://github.com/PrefectHQ/prefect/blob/a2041c7ff1a619e611f614ed3394fccd05bb2005/src/prefect/client/client.py#L633 If not, what's the best way to handle this case? Any configuration changes I could be making the Apollo pod?
k
Hey @Jason Bertman, this does sound like an overload of the apollo service. You might need to bump up the resources of that pod so it can handle this burst of requests
j
Any recommendations in terms of resource alloc?
k
I don’t know immediately. But can you see the current pod specs? Also, you you trim the traceback a bit so we can leave the main channel tidier when you get the chance?
j
These are the current resource requests set in the helm chart:
Copy code
resources:
    requests:
      cpu: "0.1"
      memory: "512Mi"
    limits:
      cpu: "2"
      memory: "4Gi"
Sure, will clean it up, one sec
k
Hard to gauge honestly, unless someone who knows more than me chimes in, I would naively double it or lessen the burst of the workload
j
rgr, any chance you know if the set task run state handler has a retrier on it? I'd imagine a simple solution would be to just retry/backoff a bit on the state change.
k
Ah yeah I saw that thing you posted. First time I’ve seen it, it seems like there is a retrier but this literally exceeded the max value
j
maybe I'm misreading it, but it seemed like it only retried for particular error codes, I wasn't certain if this one was being retried or not. I'd ... hope it would be, considering the operation's sensitivity
k
you are right if the request gets sent out for sure. Is it here though? Because this seems to be when trying to make a connection to begin with
Which ultimately is just a graphql -> POST -> _request call, which is where that initial retry code is
k
I now what you are saying, but I mean this error is like “can’t even connect” so I don’t know if this even gets a status code because the API request did not go through. Does that make sense?
j
Right exactly - if there's no status code, I'm not certain it gets retried
k
Ah ok yeah we are on the same page. I don’t know though if this lack of a connection and be retried natively with the request module. I think we might be interested in accepting a PR if you have ideas
j
I suppose my argument, if there is to be one, is that maybe we should retry it if we can. Though it's fair to say that if you run into a case like this (no response code due to network error) you'd reason that the service is unreachable and shouldn't be retried...
k
Yeah I understand. If it’s retry-able, we probably could to a reasonable max. I can make an open ended issue for this in a bit on the server repo
j
Seems reasonable to me. Thanks for making the issue, I appreciate it! Obviously you know best, but would the issue more than likely live on the client side (the prefect proper repo) since that is where the retries would take place?
I've got a Prometheus monitor up right now, and am trying to trigger the state again, hopefully I can capture something meaningful
k
You might be right on that. I’ll make one and one of the engineers can move it if need be. I will open this as a request. No promises though if it gets picked up
j
Fair enough, thanks for the help! I will report back if I can reason anything out with the resourcing
👍 1
k
Thanks for your understanding!
j
@Kevin Kho worth noting, a little digging into the urllib3 retrier shows that there is a
connect
option:
Copy code
connect (int) –

How many connection-related errors to retry on.

These are errors raised before the request is sent to the remote server, which we assume has not triggered the server to process the request.

Set to 0 to fail on the first retry of this type.
this appears to be set to 0 by default, so perhaps adding a
connect
kwarg here would help to resolve the issue (I have not tested this yet, but may give it a shot)
k
Yeah it would be good if you could give it a shot
👍 1
j
An update this morning after trying out the above: There was no change after using the
connect
kwarg. I figured it was still an issue, or it wasn't retrying, so I dug into the urllib3 code a bit more. Starting at the initial stack trace, the exception being raised is a
ConnectionError
. I (incorrectly) assumed this was covered by the
connect
case. A bit counterintuitively perhaps, the
read
kwarg catches
ProtocolError
: https://github.com/urllib3/urllib3/blob/37a67739ba9a7835886555744e9dabe3009e538a/src/urllib3/util/retry.py#L366 Which I found to be aliased to `ConnectionError`: https://github.com/urllib3/urllib3/blob/9fbab29644bfc8068eb2082d0f25726c91813337/src/urllib3/exceptions.py#L84 meaning we should be using
read
, not
connect
. The flow is still running, and has gotten past where it has failed before. Hoping it's not just dumb luck 🤞
k
So confirming adding the
read
kwarg helps here? I can open the issue in a bit to document this. This sounds hard to know if it will take effect
j
flow just finished without any issues with > 15K state lifecycles, so I'm thinking this fix will work for us. I can open up a PR, it's just a one-line change.
k
Oh yeah that would work, and then could you detail the issue so the core engineers understand? Cuz I don’t handle those
j
No problem, I will also link this thread 👍
k
Thank you! The core team will see it
👍 1