Hey Community Having an issue with the Apollo service in an Prefect Community #ask-community

Hey Community, Having an issue with the Apollo se...

Jason Bertman

03/23/2022, 5:26 PM

Hey Community, Having an issue with the Apollo service in an EKS cluster. It works pretty well, until ~500-1000 mapped tasks are trying to update their task_run_state. The problem is that if even one fails, it gets stuck in a pending state and the flow just hangs. Eventually, the prefect-job pod exits and things continue to go sideways from there.

Copy code

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 91, in call_runner_target_handlers
    state = self.client.set_task_run_state(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1922, in set_task_run_state
    result = self.graphql(
...
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='prefecthq-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f12486259d0>: Failed to establish a new connection: [Errno -2] Name or service not known'))

This error would seem to indicate that the service can't even be resolved, but this DNS name is perfectly resolvable from the dask pods, and many tasks succeed before a couple fail. I see signs of distress from the apollo service in the form of:

Copy code

│ BadRequestError: request aborted                                                                                                                                                                                                           │
│     at IncomingMessage.onAborted (/apollo/node_modules/raw-body/index.js:231:10)                                                                                                                                                           │
│     at IncomingMessage.emit (events.js:315:20)                                                                                                                                                                                             │
│     at abortIncoming (_http_server.js:561:9)                                                                                                                                                                                               │
│     at socketOnClose (_http_server.js:554:3)                                                                                                                                                                                               │
│     at Socket.emit (events.js:327:22)                                                                                                                                                                                                      │
│     at TCP.<anonymous> (net.js:673:12)

repeated pretty much as many times as the dask pod tries it. Light research on this case points to the service being overwhelmed. Is this case covered by the request retrier here? https://github.com/PrefectHQ/prefect/blob/a2041c7ff1a619e611f614ed3394fccd05bb2005/src/prefect/client/client.py#L633 If not, what's the best way to handle this case? Any configuration changes I could be making the Apollo pod?

Kevin Kho

03/23/2022, 5:48 PM

Hey @Jason Bertman, this does sound like an overload of the apollo service. You might need to bump up the resources of that pod so it can handle this burst of requests

Jason Bertman

03/23/2022, 5:48 PM

Any recommendations in terms of resource alloc?

Kevin Kho

03/23/2022, 5:51 PM

I don’t know immediately. But can you see the current pod specs? Also, you you trim the traceback a bit so we can leave the main channel tidier when you get the chance?

Jason Bertman

03/23/2022, 5:53 PM

These are the current resource requests set in the helm chart:

Copy code

resources:
    requests:
      cpu: "0.1"
      memory: "512Mi"
    limits:
      cpu: "2"
      memory: "4Gi"

Jason Bertman

03/23/2022, 5:53 PM

Sure, will clean it up, one sec

Kevin Kho

03/23/2022, 5:54 PM

Hard to gauge honestly, unless someone who knows more than me chimes in, I would naively double it or lessen the burst of the workload

Jason Bertman

03/23/2022, 5:55 PM

rgr, any chance you know if the set task run state handler has a retrier on it? I'd imagine a simple solution would be to just retry/backoff a bit on the state change.

Kevin Kho

03/23/2022, 5:57 PM

Ah yeah I saw that thing you posted. First time I’ve seen it, it seems like there is a retrier but this literally exceeded the max value

Jason Bertman

03/23/2022, 5:58 PM

maybe I'm misreading it, but it seemed like it only retried for particular error codes, I wasn't certain if this one was being retried or not. I'd ... hope it would be, considering the operation's sensitivity

Kevin Kho

03/23/2022, 6:04 PM

you are right if the request gets sent out for sure. Is it here though? Because this seems to be when trying to make a connection to begin with

Jason Bertman

03/23/2022, 6:05 PM

AFAIK the initial place it starts is here: https://github.com/PrefectHQ/prefect/blob/a2041c7ff1a619e611f614ed3394fccd05bb2005/src/prefect/engine/cloud/task_runner.py#L91

Jason Bertman

03/23/2022, 6:05 PM

Which ultimately is just a graphql -> POST -> _request call, which is where that initial retry code is

Kevin Kho

03/23/2022, 6:08 PM

I now what you are saying, but I mean this error is like “can’t even connect” so I don’t know if this even gets a status code because the API request did not go through. Does that make sense?

Jason Bertman

03/23/2022, 6:08 PM

Right exactly - if there's no status code, I'm not certain it gets retried

Kevin Kho

03/23/2022, 6:09 PM

Ah ok yeah we are on the same page. I don’t know though if this lack of a connection and be retried natively with the request module. I think we might be interested in accepting a PR if you have ideas

Jason Bertman

03/23/2022, 6:10 PM

I suppose my argument, if there is to be one, is that maybe we should retry it if we can. Though it's fair to say that if you run into a case like this (no response code due to network error) you'd reason that the service is unreachable and shouldn't be retried...

Kevin Kho

03/23/2022, 6:11 PM

Yeah I understand. If it’s retry-able, we probably could to a reasonable max. I can make an open ended issue for this in a bit on the server repo

Jason Bertman

03/23/2022, 6:13 PM

Seems reasonable to me. Thanks for making the issue, I appreciate it! Obviously you know best, but would the issue more than likely live on the client side (the prefect proper repo) since that is where the retries would take place?

Jason Bertman

03/23/2022, 6:14 PM

I've got a Prometheus monitor up right now, and am trying to trigger the state again, hopefully I can capture something meaningful

Kevin Kho

03/23/2022, 6:15 PM

You might be right on that. I’ll make one and one of the engineers can move it if need be. I will open this as a request. No promises though if it gets picked up

Jason Bertman

03/23/2022, 6:16 PM

Fair enough, thanks for the help! I will report back if I can reason anything out with the resourcing

👍 1

Kevin Kho

03/23/2022, 6:17 PM

Thanks for your understanding!

Jason Bertman

03/23/2022, 6:55 PM

@Kevin Kho worth noting, a little digging into the urllib3 retrier shows that there is a

connect

option:

Copy code

connect (int) –

How many connection-related errors to retry on.

These are errors raised before the request is sent to the remote server, which we assume has not triggered the server to process the request.

Set to 0 to fail on the first retry of this type.

this appears to be set to 0 by default, so perhaps adding a

connect

kwarg here would help to resolve the issue (I have not tested this yet, but may give it a shot)

Kevin Kho

03/23/2022, 6:56 PM

Yeah it would be good if you could give it a shot

👍 1

Jason Bertman

03/24/2022, 12:40 PM

An update this morning after trying out the above: There was no change after using the

connect

kwarg. I figured it was still an issue, or it wasn't retrying, so I dug into the urllib3 code a bit more. Starting at the initial stack trace, the exception being raised is a

ConnectionError

. I (incorrectly) assumed this was covered by the

connect

case. A bit counterintuitively perhaps, the

read

kwarg catches

ProtocolError

: https://github.com/urllib3/urllib3/blob/37a67739ba9a7835886555744e9dabe3009e538a/src/urllib3/util/retry.py#L366 Which I found to be aliased to `ConnectionError`: https://github.com/urllib3/urllib3/blob/9fbab29644bfc8068eb2082d0f25726c91813337/src/urllib3/exceptions.py#L84 meaning we should be using

read

, not

connect

. The flow is still running, and has gotten past where it has failed before. Hoping it's not just dumb luck 🤞

Kevin Kho

03/24/2022, 3:07 PM

So confirming adding the

read

kwarg helps here? I can open the issue in a bit to document this. This sounds hard to know if it will take effect

Jason Bertman

03/24/2022, 3:23 PM

flow just finished without any issues with > 15K state lifecycles, so I'm thinking this fix will work for us. I can open up a PR, it's just a one-line change.

Kevin Kho

03/24/2022, 3:26 PM

Oh yeah that would work, and then could you detail the issue so the core engineers understand? Cuz I don’t handle those

Jason Bertman

03/24/2022, 3:34 PM

No problem, I will also link this thread 👍

Jason Bertman

03/24/2022, 3:56 PM

PR is open: https://github.com/PrefectHQ/prefect/pull/5600

Kevin Kho

03/24/2022, 4:17 PM

Thank you! The core team will see it

👍 1

13 Views

Open in Slack

Previous Next