https://prefect.io logo
g

Guna

07/11/2023, 5:25 PM
Hello,i are using prefect 0.15.11 . It been more 1.5 years we have been using but recently i am facing issue with prefect server that crash very frequently i don’t find any pattern in server crash and sometime agent get disconnected. In Server logs found following error . Can help me out what is the route cause of it.
Copy code
{"severity": "ERROR", "name": "prefect-server.Scheduler", "message": "Unexpected error: ReadTimeout(TimeoutError())", "exc_info": "Traceback (most recent call last):
  File "/prefect-server/src/prefect_server/services/loop_service.py", line 60, in run
    await self.run_once()
  File "/prefect-server/src/prefect_server/services/towel/scheduler.py", line 47, in run_once
    offset=500 * iterations,
  File "/prefect-server/src/prefect_server/database/orm.py", line 501, in get
    as_box=not apply_schema,
  File "/prefect-server/src/prefect_server/database/hasura.py", line 85, in execute
    as_box=as_box,
  File "/prefect-server/src/prefect_server/utilities/graphql.py", line 64, in execute
    timeout=30,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1385, in post
    timeout=timeout,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1148, in request
    request, auth=auth, allow_redirects=allow_redirects, timeout=timeout,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1169, in send
    request, auth=auth, timeout=timeout, allow_redirects=allow_redirects,
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1196, in send_handling_redirects
    request, auth=auth, timeout=timeout, history=history
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1232, in send_handling_auth
    response = await self.send_single_request(request, timeout)
  File "/usr/local/lib/python3.7/site-packages/httpx/_client.py", line 1269, in send_single_request
    timeout=timeout.as_dict(),
  File "/usr/local/lib/python3.7/site-packages/httpcore/_async/connection_pool.py", line 153, in request
    method, url, headers=headers, stream=stream, timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/httpcore/_async/connection.py", line 78, in request
    return await self.connection.request(method, url, headers, stream, timeout)
  File "/usr/local/lib/python3.7/site-packages/httpcore/_async/http11.py", line 62, in request
    ) = await self._receive_response(timeout)
  File "/usr/local/lib/python3.7/site-packages/httpcore/_async/http11.py", line 115, in _receive_response
    event = await self._receive_event(timeout)
  File "/usr/local/lib/python3.7/site-packages/httpcore/_async/http11.py", line 145, in _receive_event
    data = await self.socket.read(self.READ_NUM_BYTES, timeout)
  File "/usr/local/lib/python3.7/site-packages/httpcore/_backends/asyncio.py", line 135, in read
    self.stream_reader.read(n), timeout.get("read")
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions
    raise to_exc(exc) from None
httpcore._exceptions.ReadTimeout"}
As per above error it shows it was due graphql connection.pls can someone suggest solution for this
n

nicholas

07/11/2023, 5:32 PM
Hi @Guna - 1.5 years running - nice! From the error it looks like your server is having trouble reading from the db, likely because your tables have gotten pretty big. I can’t say for certain but it’s possible you should truncate them to improve read performance
g

Guna

07/11/2023, 5:39 PM
@nicholas pls can share any docs/links how to truncate the data in prefect server. i have started the server with
prefect server start
which starts server the docker container
thanks for quick response
n

nicholas

07/11/2023, 6:26 PM
Sorry I should have been more specific with my recommendation - I don’t have any docs to share on this because it’s pretty implementation-specific. You can use any database tools you’re familiar with but by default
prefect server start
spins up a postgres sidecar - by truncating I mean removing data from tables in your database. I’ll caveat this by saying this data is your own and you should make backups of your database before you do any deletion.
g

Guna

07/12/2023, 5:07 AM
One more i have to add is we have deploy same flow in new server last week. Today morning the server was crash.Logs for apollo
Copy code
"PayloadTooLargeError: request entity too large
   ...:     at readStream (/apollo/node_modules/raw-body/index.js:155:17)
   ...:     at getRawBody (/apollo/node_modules/raw-body/index.js:108:12)
   ...:     at read (/apollo/node_modules/body-parser/lib/read.js:77:3)
   ...:     at jsonParser (/apollo/node_modules/body-parser/lib/types/json.js:135:5)
   ...:     at Layer.handle [as handle_request] (/apollo/node_modules/express/lib/router/layer.js:95:5)
   ...:     at trim_prefix (/apollo/node_modules/express/lib/router/index.js:317:13)
   ...:     at /apollo/node_modules/express/lib/router/index.js:284:7
   ...:     at Function.process_params (/apollo/node_modules/express/lib/router/index.js:335:12)
   ...:     at next (/apollo/node_modules/express/lib/router/index.js:275:10)
   ...:     at cors (/apollo/node_modules/cors/lib/index.js:188:7)
   ...: PayloadTooLargeError: request entity too large
And in server log:
Copy code
{"type":"http-log","timestamp":"2023-07-12T02:32:23.085+0000","level":"error","detail":{"operation":{"user_vars":{"x-hasura-role":"admin"},"error":{"internal":"could not translate host name "postgres" to address: Name or service not known
","path":"$","error":"connection error","code":"postgres-error"},"request_id":"99afc8bd-aaad-490e-931c-c5e50b2e0641","response_size":159,"query":{"variables":{"insert_objects":[{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"a93fabf5-762f-46c1-9b06-503918684456","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:07:59.306652+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"76a988ef-4fad-45b1-9d1b-eb172694696f","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:08:14.403071+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"4459791b-c10f-45c9-b515-71820f252461","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:08:29.669403+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"5006e050-7094-4152-a42c-e489bd143d32","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:08:44.769727+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"ec53677d-f87b-4c63-9d2c-8b8969c66df4","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:08:59.867201+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"ff96628b-e4fe-4489-844d-9d5bb63b1789","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:09:14.960960+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"82e07da7-9fb2-42ba-a648-25c385a4480a","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:09:30.191609+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"b1cf2c77-4307-4580-a777-3f59380b9f13","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:09:45.292941+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"25000778-7ab3-400e-a102-f21d2a7dd0cc","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:10:00.401761+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"31a1eca8-dd81-47c2-8d61-bb6928e86951","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:10:15.513793+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"e89ea0a8-28e9-423c-83d5-ab7e5686eb03","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:10:31.002676+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"8729825a-4b78-4fce-a8a4-2b18dd28d13e","message":"Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">","timestamp":"2023-07-12T02:10:46.095858+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"c0574006-afd1-4f05-ab55-d3b584882eee","message":"Error getting flow run info
Traceback (most recent call last):
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/prefect/engine/cloud/flow_runner.py", line 188, in interrupt_if_cancelling
    flow_run_info = self.client.get_flow_run_info(flow_run_id)
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/prefect/client/client.py", line 1564, in get_flow_run_info
    result = self.graphql(query).data.flow_run_by_pk  # type: ignore
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/prefect/client/client.py", line 570, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'message': 'connection error', 'locations': [{'line': 2, 'column': 5}], 'path': ['flow_run_by_pk'], 'extensions': {'internal': 'could not translate host name "postgres" to address: Name or service not known\n', 'path': '$', 'code': 'postgres-error', 'exception': {'message': 'connection error'}}}]","timestamp":"2023-07-12T02:11:01.209955+00:00","level":"WARNING","flow_run_id":"c2f8cb02-2012-48bb-ae85-67be141cbed9","info":null},{"tenant_id":null,"task_run_id":null,"name":"prefect.CloudFlowRunner","id":"7ddc3a30-3e09-4f8e-b776-3a696ab38969","message":"Error getting flow run info
Traceback (most recent call last):
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/opt/rh/rh-python38/root/lib64/python3.8/http/client.py", line 1347, in getresponse
    response.begin()
  File "/opt/rh/rh-python38/root/lib64/python3.8/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/opt/rh/rh-python38/root/lib64/python3.8/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/opt/rh/rh-python38/root/lib64/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 451, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/infra/prefect_server/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 340, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='10.40.1.55', port=4200): Read timed out. (read timeout=15)
It was deployed in last week only even it was failed
n

nicholas

07/12/2023, 3:43 PM
Hm tough to provide a lot of support on this - it _look_s like you’re running into issues with the graphql payload request limit, which is usually because you’re passing large parameter sets or something with your flow registration but I can’t tell for sure
🙌 1