Could anybody give me some idea of sensible default resource Prefect Community #prefect-server

Could anybody give me some idea of sensible defaul...

Josh Greenhalgh

03/16/2021, 11:10 PM

Could anybody give me some idea of sensible default resource limits for the helm chart? They are all unspecified. My apollo server seems to be having trouble with connections and I wonder if its because the deployment is not requesting enough resources? I have a few mapped tasks feeding into each other and all goes well until towards the end when some of the mapped tasks finish in a pending state - this is completely random behaviour some times everything works as expected. The particular error I get in the logs is this in the thread

Josh Greenhalgh

03/16/2021, 11:11 PM

Copy code

Failed to set task state with error: ConnectionError(MaxRetryError("HTTPConnectionPool(host='34.105.133.228', port=4200): Max retries exceeded with url: /graphql/graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd4b2539df0>: Failed to establish a new connection: [Errno 104] Connection reset by peer'))"))
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 169, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.8/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1301, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1010, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.8/http/client.py", line 950, in send
    self.connect()
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 181, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fd4b2539df0>: Failed to establish a new connection: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 573, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='34.105.133.228', port=4200): Max retries exceeded with url: /graphql/graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd4b2539df0>: Failed to establish a new connection: [Errno 104] Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 98, in call_runner_target_handlers
    state = self.client.set_task_run_state(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1518, in set_task_run_state
    result = self.graphql(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 298, in graphql
    result = <http://self.post|self.post>(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 213, in post
    response = self._request(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 459, in _request
    response = self._send_request(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 351, in _send_request
    response = <http://session.post|session.post>(
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 590, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='34.105.133.228', port=4200): Max retries exceeded with url: /graphql/graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd4b2539df0>: Failed to establish a new connection: [Errno 104] Connection reset by peer'))

nicholas

03/16/2021, 11:20 PM

Hi @Josh Greenhalgh - that error is suspicious, particularly this line:

Max retries exceeded with url: /graphql/graphql

- that's not a valid endpoint as far as I know

nicholas

03/16/2021, 11:21 PM

Which would explain the retries exceeded

Josh Greenhalgh

03/16/2021, 11:47 PM

Yeah absolutely that line stood out to me too - but as I said the previous 10s of mapped tasks all run fine. Have actually just noticed something weird going on with apollos memory usage - just seems to be increasing (I killed the pod at the end)?

nicholas

03/16/2021, 11:48 PM

What's the scale of the mapped runs you're generating?

Josh Greenhalgh

03/16/2021, 11:49 PM

not huge like 70

Josh Greenhalgh

03/16/2021, 11:50 PM

But I would like to get quite big

nicholas

03/16/2021, 11:55 PM

Hm got it - well tbh that's not a lot of memory usage for the application; the limits you give will really depend on the workloads you're expecting so I can't recommend any hard cap however I wouldn't be surprised if certain queries were leading to extended connections, which could lead to the error you're seeing. Setting task run states in particular can be extremely taxing on a server/db, since so many parts of the system need to be accessed in a short amount of time.

Josh Greenhalgh

03/16/2021, 11:58 PM

Ok so it could be database problem?

✅ 1

nicholas

03/16/2021, 11:58 PM

Definitely could be; I'd investigate the number of active connections to the DB

Josh Greenhalgh

03/17/2021, 10:54 AM

Any guidance re size of db to use?

Josh Greenhalgh

03/17/2021, 10:54 AM

I am on gcp

Josh Greenhalgh

03/17/2021, 1:58 PM

So I bumped apollo cpu request and limit to 0.25 and 0.5 respectively and all seems good!? Would be great if anyone has any guidance on how to right size the various deployments? Not ness anything proscriptive but some ideas perhaps?

Josh Greenhalgh

03/17/2021, 4:38 PM

Oh well nope that has not helped... 😭

Adam Brusselback

03/17/2021, 4:41 PM

@Josh Greenhalgh I have seen the same thing... For me, reducing the health check frequency a ton and giving the server a little more resources fixed it

Adam Brusselback

03/17/2021, 4:41 PM

I am using an external PG database, so that had nothing at all to do with stability

Josh Greenhalgh

03/17/2021, 4:42 PM

So I have set all services to have request 0.25 and limit 0.5 on CPU - have left memory alone (should I set this too?) - when you say health check frequency of which service? Apollo?

Josh Greenhalgh

03/18/2021, 12:25 PM

Can I modify the timeout on requests to graphql from within the flow?

Josh Greenhalgh

03/18/2021, 12:48 PM

I am really at my witts end now - I am sitting here massaging ~100 flow runs to succeed by finding the tasks where the timeout happens and the task is in pending state canceling them and restarting the flow! If this is the level of manual intervention that is going to be required for very simple computations then prefect is close to unusable...Can someone please help 😭 I know the solution is to use cloud but I just cannot do that right now...in the meantime how can I make the api server robust to these timeout issues - how can I increase the timeout for example (seems to come from urllib3) can I set the default somehow? I have added random jitter to my tasks (sleep for some time after finishing) in the hope that this will stop the api server getting swapped with requests in lockstep as my tasks finish this has improved matters but still problems arise...and my sub 1s tasks now take up to 5s (maybe the problem is simply that my tasks are too small? Is there a rec on minimal task runtime should be? The power of prefect for me is that I can write small easily testable functions and leave prefect to do the orchestration maybe this is the wrong pattern!?) I have tried adding another replica of apollo this didn't help...

Josh Greenhalgh

03/18/2021, 1:34 PM

May be speaking too soon yet again but....combination of sleeping in my tasks and having 2 replicas of both apollo and graphql services seems to have solved....fingers tightly crossed!

Josh Greenhalgh

03/18/2021, 1:38 PM

nope...

Josh Greenhalgh

03/18/2021, 1:40 PM

it did improve things significantly though - trying three replicas...

Josh Greenhalgh

03/18/2021, 2:52 PM

booo! nope!

Mariia Kerimova

03/18/2021, 3:25 PM

Hello Josh! Do you use any other monitoring tools besides GCP? If not, can you describe the pods and check if terminated status was caused by OOMKilled. If you don't allocate enough memory to pods, kubernetes scheduler will kill the pod, which violates memory limits.

Josh Greenhalgh

03/18/2021, 5:08 PM

Which pods? They aren't restarting apollo/graphql that is...

Josh Greenhalgh

03/18/2021, 5:22 PM

"Do you use any other monitoring tools besides GCP?" like what?

Mariia Kerimova

03/18/2021, 5:25 PM

You can deploy kube-prometheus-stack helm chart, datadog, etc. It would help you to monitor resource utilization by your pods.

Josh Greenhalgh

03/18/2021, 5:33 PM

but I really don't think its a resource problem I have 4 replicas of both apollo and graphql now and the issues still happens granted much less frequently...

Josh Greenhalgh

03/18/2021, 5:33 PM

none of the services are restarting just getting the timeout error randomly in some tasks

Josh Greenhalgh

03/18/2021, 5:34 PM

apollo resource usage;

Josh Greenhalgh

03/18/2021, 5:36 PM

graphql;

Josh Greenhalgh

03/18/2021, 5:36 PM

used is significantly below requested and limit

Mariia Kerimova

03/18/2021, 5:37 PM

I'm glad you are seeing less issues, and looks like your apollo and graphql resource usage is under limits, it narrows the scope of identifying the problem.

Josh Greenhalgh

03/18/2021, 5:37 PM

No pod restarts;

Copy code

prefect-agent-b48754c7-phlpf              1/1     Running   0          27h
prefect-server-apollo-664847899-bps5g     1/1     Running   0          4h15m
prefect-server-apollo-664847899-dsmzx     1/1     Running   0          4h16m
prefect-server-apollo-664847899-l5q6r     1/1     Running   0          3h22m
prefect-server-apollo-664847899-p88zt     1/1     Running   0          3h56m
prefect-server-graphql-54c7877659-7mmz6   1/1     Running   0          3h22m
prefect-server-graphql-54c7877659-gf7f5   1/1     Running   0          4h15m
prefect-server-graphql-54c7877659-x82lz   1/1     Running   0          4h16m
prefect-server-graphql-54c7877659-x947p   1/1     Running   0          3h56m
prefect-server-hasura-768f675d98-jt9qt    1/1     Running   0          23h
prefect-server-postgresql-0               1/1     Running   0          27h
prefect-server-towel-6f9b7bd98c-jr5d5     1/1     Running   0          23h
prefect-server-ui-5785c45564-5l9cq        1/1     Running   0          23h

Josh Greenhalgh

03/18/2021, 5:39 PM

I really want to see some logs in apollo that could help diagnose but even at debug level I just have;

Copy code

Checking GraphQL service at <http://prefect-server-graphql.prefect:4201/health> ...
{"status":"ok","version":"2021.02.22"}
GraphQL service healthy!

> @ serve /apollo
> node dist/index.js

Building schema...
Building schema complete!
Server ready at <http://0.0.0.0:4200> 🚀 (version: 2021.02.22)

Josh Greenhalgh

03/18/2021, 5:40 PM

This is what my flow code looks like;

Copy code

with Flow("weather_data_etl") as flow:

    extract_date = Parameter("extract_date", default=None)
    extract_period_days = Parameter("extract_period_days", default=None)

    asset_filters = Parameter("asset_filters", default=None)
    asset_limit = Parameter("asset_limit", default=None)

    sql = parameterise_sql(asset_filters=asset_filters, asset_limit=asset_limit)

    asset_locations = get_locations(sql=sql)

    raw = extract_raw_from_api.map(
        asset_location=asset_locations,
        extract_date=unmapped(extract_date),
        extract_period_days=unmapped(extract_period_days),
    )

    daily_data = process_daily_data.map(raw=raw, asset_location=asset_locations)
    hourly_data = process_hourly_data.map(raw=raw, asset_location=asset_locations)

    merged = merge_daily_hourly.map(daily_data=daily_data, hourly_data=hourly_data)

    df = load_result(data=merged)

Josh Greenhalgh

03/18/2021, 5:41 PM

asset_locations

is currently of size 70 but need it to get much bigger

Josh Greenhalgh

03/18/2021, 5:42 PM

Also given no pods are actually falling over a way to increase whatever timeout/number of retries would more than likely solve my issue but cannot see anyway of accomplishing this...

Josh Greenhalgh

03/18/2021, 5:47 PM

This is basically the kind of state I always end up in (the canceled tasks actually end in pending and the job pod terminates - I cancel and retry and all works but I cannot be doing that continuously);

Mariia Kerimova

03/18/2021, 6:43 PM

Unfortunately I can't help you to solve this problem right now, but would you mind opening the issue with details you've provided? Right now I have difficulties with reproducing this error.

Josh Greenhalgh

03/18/2021, 6:48 PM

Yeah absolutely - its defo gonna be hard to reproduce but I can provide terraform IaC if that would help?

Josh Greenhalgh

03/18/2021, 6:49 PM

Will try to setup minimal repo

Mariia Kerimova

03/18/2021, 8:13 PM

It would be great! Any information will be helpful!

Josh Greenhalgh

03/24/2021, 2:52 PM

So thought I should come back and update here - issues seem to have gone now - steps I took to solve were; • Use the official helm chart repo as opposed to the chart I copied from the repo some time in the past • Bump versions to 0.14.12

Mariia Kerimova

03/24/2021, 2:54 PM

Sounds great! And thank you for your update! 🙂

Josh Greenhalgh

03/24/2021, 2:55 PM

I knew it was a bad idea to be pointing at a chart from a while back just didn't have the time to get back and change to the official until now...

3 Views

Open in Slack

Previous Next