Could anybody give me some idea of sensible defaul...
# prefect-server
j
Could anybody give me some idea of sensible default resource limits for the helm chart? They are all unspecified. My apollo server seems to be having trouble with connections and I wonder if its because the deployment is not requesting enough resources? I have a few mapped tasks feeding into each other and all goes well until towards the end when some of the mapped tasks finish in a pending state - this is completely random behaviour some times everything works as expected. The particular error I get in the logs is this in the thread
Copy code
Failed to set task state with error: ConnectionError(MaxRetryError("HTTPConnectionPool(host='34.105.133.228', port=4200): Max retries exceeded with url: /graphql/graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd4b2539df0>: Failed to establish a new connection: [Errno 104] Connection reset by peer'))"))
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 169, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.8/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1301, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1010, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.8/http/client.py", line 950, in send
    self.connect()
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 181, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fd4b2539df0>: Failed to establish a new connection: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 573, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='34.105.133.228', port=4200): Max retries exceeded with url: /graphql/graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd4b2539df0>: Failed to establish a new connection: [Errno 104] Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 98, in call_runner_target_handlers
    state = self.client.set_task_run_state(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1518, in set_task_run_state
    result = self.graphql(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 298, in graphql
    result = <http://self.post|self.post>(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 213, in post
    response = self._request(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 459, in _request
    response = self._send_request(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 351, in _send_request
    response = <http://session.post|session.post>(
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 590, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='34.105.133.228', port=4200): Max retries exceeded with url: /graphql/graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd4b2539df0>: Failed to establish a new connection: [Errno 104] Connection reset by peer'))
n
Hi @Josh Greenhalgh - that error is suspicious, particularly this line:
Max retries exceeded with url: /graphql/graphql
- that's not a valid endpoint as far as I know
Which would explain the retries exceeded
j
Yeah absolutely that line stood out to me too - but as I said the previous 10s of mapped tasks all run fine. Have actually just noticed something weird going on with apollos memory usage - just seems to be increasing (I killed the pod at the end)?
n
What's the scale of the mapped runs you're generating?
j
not huge like 70
But I would like to get quite big
n
Hm got it - well tbh that's not a lot of memory usage for the application; the limits you give will really depend on the workloads you're expecting so I can't recommend any hard cap however I wouldn't be surprised if certain queries were leading to extended connections, which could lead to the error you're seeing. Setting task run states in particular can be extremely taxing on a server/db, since so many parts of the system need to be accessed in a short amount of time.
j
Ok so it could be database problem?
āœ… 1
n
Definitely could be; I'd investigate the number of active connections to the DB
j
Any guidance re size of db to use?
I am on gcp
So I bumped apollo cpu request and limit to 0.25 and 0.5 respectively and all seems good!? Would be great if anyone has any guidance on how to right size the various deployments? Not ness anything proscriptive but some ideas perhaps?
Oh well nope that has not helped... šŸ˜­
a
@Josh Greenhalgh I have seen the same thing... For me, reducing the health check frequency a ton and giving the server a little more resources fixed it
I am using an external PG database, so that had nothing at all to do with stability
j
So I have set all services to have request 0.25 and limit 0.5 on CPU - have left memory alone (should I set this too?) - when you say health check frequency of which service? Apollo?
Can I modify the timeout on requests to graphql from within the flow?
I am really at my witts end now - I am sitting here massaging ~100 flow runs to succeed by finding the tasks where the timeout happens and the task is in pending state canceling them and restarting the flow! If this is the level of manual intervention that is going to be required for very simple computations then prefect is close to unusable...Can someone please help šŸ˜­ I know the solution is to use cloud but I just cannot do that right now...in the meantime how can I make the api server robust to these timeout issues - how can I increase the timeout for example (seems to come from urllib3) can I set the default somehow? I have added random jitter to my tasks (sleep for some time after finishing) in the hope that this will stop the api server getting swapped with requests in lockstep as my tasks finish this has improved matters but still problems arise...and my sub 1s tasks now take up to 5s (maybe the problem is simply that my tasks are too small? Is there a rec on minimal task runtime should be? The power of prefect for me is that I can write small easily testable functions and leave prefect to do the orchestration maybe this is the wrong pattern!?) I have tried adding another replica of apollo this didn't help...
May be speaking too soon yet again but....combination of sleeping in my tasks and having 2 replicas of both apollo and graphql services seems to have solved....fingers tightly crossed!
nope...
it did improve things significantly though - trying three replicas...
booo! nope!
m
Hello Josh! Do you use any other monitoring tools besides GCP? If not, can you describe the pods and check if terminated status was caused by OOMKilled. If you don't allocate enough memory to pods, kubernetes scheduler will kill the pod, which violates memory limits.
j
Which pods? They aren't restarting apollo/graphql that is...
"Do you use any other monitoring tools besides GCP?" like what?
m
You can deploy kube-prometheus-stack helm chart, datadog, etc. It would help you to monitor resource utilization by your pods.
j
but I really don't think its a resource problem I have 4 replicas of both apollo and graphql now and the issues still happens granted much less frequently...
none of the services are restarting just getting the timeout error randomly in some tasks
apollo resource usage;
graphql;
used is significantly below requested and limit
m
I'm glad you are seeing less issues, and looks like your apollo and graphql resource usage is under limits, it narrows the scope of identifying the problem.
j
No pod restarts;
Copy code
prefect-agent-b48754c7-phlpf              1/1     Running   0          27h
prefect-server-apollo-664847899-bps5g     1/1     Running   0          4h15m
prefect-server-apollo-664847899-dsmzx     1/1     Running   0          4h16m
prefect-server-apollo-664847899-l5q6r     1/1     Running   0          3h22m
prefect-server-apollo-664847899-p88zt     1/1     Running   0          3h56m
prefect-server-graphql-54c7877659-7mmz6   1/1     Running   0          3h22m
prefect-server-graphql-54c7877659-gf7f5   1/1     Running   0          4h15m
prefect-server-graphql-54c7877659-x82lz   1/1     Running   0          4h16m
prefect-server-graphql-54c7877659-x947p   1/1     Running   0          3h56m
prefect-server-hasura-768f675d98-jt9qt    1/1     Running   0          23h
prefect-server-postgresql-0               1/1     Running   0          27h
prefect-server-towel-6f9b7bd98c-jr5d5     1/1     Running   0          23h
prefect-server-ui-5785c45564-5l9cq        1/1     Running   0          23h
I really want to see some logs in apollo that could help diagnose but even at debug level I just have;
Copy code
Checking GraphQL service at <http://prefect-server-graphql.prefect:4201/health> ...
{"status":"ok","version":"2021.02.22"}
GraphQL service healthy!

> @ serve /apollo
> node dist/index.js

Building schema...
Building schema complete!
Server ready at <http://0.0.0.0:4200> šŸš€ (version: 2021.02.22)
This is what my flow code looks like;
Copy code
with Flow("weather_data_etl") as flow:

    extract_date = Parameter("extract_date", default=None)
    extract_period_days = Parameter("extract_period_days", default=None)

    asset_filters = Parameter("asset_filters", default=None)
    asset_limit = Parameter("asset_limit", default=None)

    sql = parameterise_sql(asset_filters=asset_filters, asset_limit=asset_limit)

    asset_locations = get_locations(sql=sql)

    raw = extract_raw_from_api.map(
        asset_location=asset_locations,
        extract_date=unmapped(extract_date),
        extract_period_days=unmapped(extract_period_days),
    )

    daily_data = process_daily_data.map(raw=raw, asset_location=asset_locations)
    hourly_data = process_hourly_data.map(raw=raw, asset_location=asset_locations)

    merged = merge_daily_hourly.map(daily_data=daily_data, hourly_data=hourly_data)

    df = load_result(data=merged)
asset_locations
is currently of size 70 but need it to get much bigger
Also given no pods are actually falling over a way to increase whatever timeout/number of retries would more than likely solve my issue but cannot see anyway of accomplishing this...
This is basically the kind of state I always end up in (the canceled tasks actually end in pending and the job pod terminates - I cancel and retry and all works but I cannot be doing that continuously);
m
Unfortunately I can't help you to solve this problem right now, but would you mind opening the issue with details you've provided? Right now I have difficulties with reproducing this error.
j
Yeah absolutely - its defo gonna be hard to reproduce but I can provide terraform IaC if that would help?
Will try to setup minimal repo
m
It would be great! Any information will be helpful!
j
So thought I should come back and update here - issues seem to have gone now - steps I took to solve were; ā€¢ Use the official helm chart repo as opposed to the chart I copied from the repo some time in the past ā€¢ Bump versions to 0.14.12
m
Sounds great! And thank you for your update! šŸ™‚
j
I knew it was a bad idea to be pointing at a chart from a while back just didn't have the time to get back and change to the official until now...