Hi, we've been having some issues with Prefect. A ...
# prefect-community
a
Hi, we've been having some issues with Prefect. A couple of hour flows have been running about 14x the time they usually take and I can't seem to cancel and restart them. Logs in thread...
1
Copy code
Failed to set task state with error: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='prefect-apollo.data', port=4200): Read timed out. (read timeout=15)"))
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 440, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 451, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 340, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='prefect-apollo.data', port=4200): Read timed out. (read timeout=15)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 91, in call_runner_target_handlers
    state = self.client.set_task_run_state(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1598, in set_task_run_state
    result = self.graphql(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 452, in graphql
    result = <http://self.post|self.post>(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 407, in post
    response = self._request(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 641, in _request
    response = self._send_request(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 506, in _send_request
    response = <http://session.post|session.post>(
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 577, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='prefect-apollo.data', port=4200): Read timed out. (read timeout=15)
Copy code
Failed to retrieve task state with error: ClientError([{'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_or_create_task_run_info'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}])
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 154, in initialize_run
    task_run_info = self.client.get_task_run_info(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1479, in get_task_run_info
    result = self.graphql(mutation)  # type: Any
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_or_create_task_run_info'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
Restarting the flow gives this error:
Copy code
Error: GraphQL error: [{'extensions': {'internal': {'statement': 'WITH "task_run_state__mutation_result_alias" AS (INSERT INTO "public"."task_run_state" ( "state", "created", "tenant_id", "task_run_id", "result", "version", "serialized_state", "start_time", "id", "updated", "message", "timestamp" ) VALUES ((\'Pending\')::varchar, DEFAULT, (\'33424649-439b-4eb5-9741-9513015f8880\')::uuid, (\'d2a6f65b-aab6-4221-93f7-356ebd6561fc\')::uuid, (NULL)::jsonb, (\'4\')::integer, (\'{"context":{},"_result":{"__version__":"1.2.0+10.gafda99411","type":"NoResultType"},"__version__":"1.2.0+10.gafda99411","cached_inputs":{},"type":"Pending","message":"null restarted this flow run"}\')::jsonb, (NULL)::timestamptz, (\'fb9fca86-1b82-420f-b19d-b8299860d5bd\')::uuid, DEFAULT, (\'null restarted this flow run\')::varchar, (\'2022-07-27T17:42:45.772814Z\')::timestamptz) RETURNING * , (\'true\')::boolean AS "check__constraint"), "task_run_state__all_columns_alias" AS (SELECT "id" , "tenant_id" , "task_run_id" , "timestamp" , "state" , "message" , "result" , "start_time" , "serialized_state" , "created" , "updated" , "version" FROM "task_run_state__mutation_result_alias" ) SELECT json_build_object(\'returning\', (SELECT coalesce(json_agg("root" ), \'[]\' ) AS "root" FROM (SELECT row_to_json((SELECT "_1_e" FROM (SELECT "_0_root.base"."id" AS "id" ) AS "_1_e" ) ) AS "root" FROM (SELECT * FROM "task_run_state__all_columns_alias" WHERE (\'true\') ) AS "_0_root.base" ) AS "_2_root" ) ) , (SELECT coalesce(bool_and("check__constraint" ), \'true\' ) FROM "task_run_state__mutation_result_alias" ) ', 'prepared': False, 'error': {'exec_status': 'FatalError', 'hint': 'Check free disk space.', 'message': 'could not extend file "base/17149/17749.2": No space left on device', 'status_code': '53100', 'description': None}, 'arguments': []}, 'path': '$.selectionSet.insert_task_run_state.args.objects', 'code': 'unexpected'}, 'message': 'database query error'}]
Trying to cancel the flows says:
Copy code
Something went wrong when trying to cancel this flow run, please try again.
And then there's also this error that I am struggling to identify the root of:
Copy code
Error getting flow run info
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/flow_runner.py", line 188, in interrupt_if_cancelling
    flow_run_info = self.client.get_flow_run_info(flow_run_id)
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1240, in get_flow_run_info
    result = self.graphql(query).data.flow_run_by_pk  # type: ignore
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'message': 'connection error', 'locations': [{'line': 2, 'column': 5}], 'path': ['flow_run_by_pk'], 'extensions': {'internal': 'FATAL:  the database system is in recovery mode\nFATAL:  the database system is in recovery mode\n', 'path': '$', 'code': 'postgres-error', 'exception': {'message': 'connection error'}}}]
t
Hi Apoorva, what is the prefect version?
a
1.2.0
@Taylor Curran gentle bump as our production is broken
n
Hi @Apoorva Desai - can you give us a bit more info on your system? Are you running in Prefect Cloud or Prefect Server?
a
Prefect Server
n
Got it - it looks like this is an infrastructure problem; your database appears to be down.
Ah yup ok
If you look in the second message, it says that your disk space is full; your database is unable to write
a
The problems started yesterday at 1:52 PM PST when multiple flows were triggered multiple times at exactly the same time. I was working on syncing Fivetran through Prefect using
Fivetransync task
. This problem resolved itself automatically a little while after. Then I was able to test and implement Fivetransync through prefect successfully. It ran a few times successfully and then started failing and now I am unable to restart or cancel the flows.
n
Right, if your db is unable to handle transactions your flows can’t really run against the API
It’s possible you freed up some disc space with whatever you were doing yesterday but that the issue reappeared because you’re writing a lot
a
Which database are you suggesting is out of space? the database that Prefect uses internal to Prefect? That database?
n
Sorry I must be misunderstanding - if you’re running Prefect Server then the database is the one that you spun up with running
prefect server start
a
Ah okay, thanks. I am sharing this information internally with my team. I'll be back if I have more questions. I appreciate your help, thank you!
n
Understood, happy to help!
marvin 1
a
So my team wants to clarify that we're using the prefect provided helm charts. Does that change anything here?
All of our databases have free disk space
Is there an internal database managed by prefect that is NOT our RDS? Does
prefect server start
start that DB? Could it be that that DB is running out of space?
n
Hm either way you have a database pod in your k8s cluster that doesn’t seem to be working quite right; the error message you’re getting from your API hints that whatever database the API is configured to hit isn’t able to write
Yes @Apoorva Desai - check out the Prefect Server helm chart database section
a
Thank again, brb 😅
We've identified the problem and you were right, it was a disk space issue. Thank you so much for all your help!
Prefect Community rocks!
🙌 1
marvin 3
🙏 1
🦜 2
yay
Side note, we saw that prefect ate about 32 gb of memory in 23 hours. What does prefect store in this backend database?
1
Is there a way I can configure it to delete history older than x?
a
Thank you!