https://prefect.io logo
Title
a

Apoorva Desai

07/27/2022, 5:55 PM
Hi, we've been having some issues with Prefect. A couple of hour flows have been running about 14x the time they usually take and I can't seem to cancel and restart them. Logs in thread...
1
Failed to set task state with error: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='prefect-apollo.data', port=4200): Read timed out. (read timeout=15)"))
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 440, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 451, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 340, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='prefect-apollo.data', port=4200): Read timed out. (read timeout=15)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 91, in call_runner_target_handlers
    state = self.client.set_task_run_state(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1598, in set_task_run_state
    result = self.graphql(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 452, in graphql
    result = <http://self.post|self.post>(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 407, in post
    response = self._request(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 641, in _request
    response = self._send_request(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 506, in _send_request
    response = <http://session.post|session.post>(
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 577, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='prefect-apollo.data', port=4200): Read timed out. (read timeout=15)
Failed to retrieve task state with error: ClientError([{'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_or_create_task_run_info'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}])
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 154, in initialize_run
    task_run_info = self.client.get_task_run_info(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1479, in get_task_run_info
    result = self.graphql(mutation)  # type: Any
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_or_create_task_run_info'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
Restarting the flow gives this error:
Error: GraphQL error: [{'extensions': {'internal': {'statement': 'WITH "task_run_state__mutation_result_alias" AS (INSERT INTO "public"."task_run_state" ( "state", "created", "tenant_id", "task_run_id", "result", "version", "serialized_state", "start_time", "id", "updated", "message", "timestamp" ) VALUES ((\'Pending\')::varchar, DEFAULT, (\'33424649-439b-4eb5-9741-9513015f8880\')::uuid, (\'d2a6f65b-aab6-4221-93f7-356ebd6561fc\')::uuid, (NULL)::jsonb, (\'4\')::integer, (\'{"context":{},"_result":{"__version__":"1.2.0+10.gafda99411","type":"NoResultType"},"__version__":"1.2.0+10.gafda99411","cached_inputs":{},"type":"Pending","message":"null restarted this flow run"}\')::jsonb, (NULL)::timestamptz, (\'fb9fca86-1b82-420f-b19d-b8299860d5bd\')::uuid, DEFAULT, (\'null restarted this flow run\')::varchar, (\'2022-07-27T17:42:45.772814Z\')::timestamptz) RETURNING * , (\'true\')::boolean AS "check__constraint"), "task_run_state__all_columns_alias" AS (SELECT "id" , "tenant_id" , "task_run_id" , "timestamp" , "state" , "message" , "result" , "start_time" , "serialized_state" , "created" , "updated" , "version" FROM "task_run_state__mutation_result_alias" ) SELECT json_build_object(\'returning\', (SELECT coalesce(json_agg("root" ), \'[]\' ) AS "root" FROM (SELECT row_to_json((SELECT "_1_e" FROM (SELECT "_0_root.base"."id" AS "id" ) AS "_1_e" ) ) AS "root" FROM (SELECT * FROM "task_run_state__all_columns_alias" WHERE (\'true\') ) AS "_0_root.base" ) AS "_2_root" ) ) , (SELECT coalesce(bool_and("check__constraint" ), \'true\' ) FROM "task_run_state__mutation_result_alias" ) ', 'prepared': False, 'error': {'exec_status': 'FatalError', 'hint': 'Check free disk space.', 'message': 'could not extend file "base/17149/17749.2": No space left on device', 'status_code': '53100', 'description': None}, 'arguments': []}, 'path': '$.selectionSet.insert_task_run_state.args.objects', 'code': 'unexpected'}, 'message': 'database query error'}]
Trying to cancel the flows says:
Something went wrong when trying to cancel this flow run, please try again.
And then there's also this error that I am struggling to identify the root of:
Error getting flow run info
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/flow_runner.py", line 188, in interrupt_if_cancelling
    flow_run_info = self.client.get_flow_run_info(flow_run_id)
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1240, in get_flow_run_info
    result = self.graphql(query).data.flow_run_by_pk  # type: ignore
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'message': 'connection error', 'locations': [{'line': 2, 'column': 5}], 'path': ['flow_run_by_pk'], 'extensions': {'internal': 'FATAL:  the database system is in recovery mode\nFATAL:  the database system is in recovery mode\n', 'path': '$', 'code': 'postgres-error', 'exception': {'message': 'connection error'}}}]
t

Taylor Curran

07/27/2022, 6:15 PM
Hi Apoorva, what is the prefect version?
a

Apoorva Desai

07/27/2022, 6:21 PM
1.2.0
@Taylor Curran gentle bump as our production is broken
n

nicholas

07/27/2022, 7:35 PM
Hi @Apoorva Desai - can you give us a bit more info on your system? Are you running in Prefect Cloud or Prefect Server?
a

Apoorva Desai

07/27/2022, 7:37 PM
Prefect Server
n

nicholas

07/27/2022, 7:37 PM
Got it - it looks like this is an infrastructure problem; your database appears to be down.
Ah yup ok
If you look in the second message, it says that your disk space is full; your database is unable to write
a

Apoorva Desai

07/27/2022, 7:39 PM
The problems started yesterday at 1:52 PM PST when multiple flows were triggered multiple times at exactly the same time. I was working on syncing Fivetran through Prefect using
Fivetransync task
. This problem resolved itself automatically a little while after. Then I was able to test and implement Fivetransync through prefect successfully. It ran a few times successfully and then started failing and now I am unable to restart or cancel the flows.
n

nicholas

07/27/2022, 7:40 PM
Right, if your db is unable to handle transactions your flows can’t really run against the API
It’s possible you freed up some disc space with whatever you were doing yesterday but that the issue reappeared because you’re writing a lot
a

Apoorva Desai

07/27/2022, 7:41 PM
Which database are you suggesting is out of space? the database that Prefect uses internal to Prefect? That database?
n

nicholas

07/27/2022, 7:43 PM
Sorry I must be misunderstanding - if you’re running Prefect Server then the database is the one that you spun up with running
prefect server start
a

Apoorva Desai

07/27/2022, 7:44 PM
Ah okay, thanks. I am sharing this information internally with my team. I'll be back if I have more questions. I appreciate your help, thank you!
n

nicholas

07/27/2022, 7:45 PM
Understood, happy to help!
:marvin: 1
a

Apoorva Desai

07/27/2022, 7:49 PM
So my team wants to clarify that we're using the prefect provided helm charts. Does that change anything here?
All of our databases have free disk space
Is there an internal database managed by prefect that is NOT our RDS? Does
prefect server start
start that DB? Could it be that that DB is running out of space?
n

nicholas

07/27/2022, 7:55 PM
Hm either way you have a database pod in your k8s cluster that doesn’t seem to be working quite right; the error message you’re getting from your API hints that whatever database the API is configured to hit isn’t able to write
Yes @Apoorva Desai - check out the Prefect Server helm chart database section
a

Apoorva Desai

07/27/2022, 7:57 PM
Thank again, brb 😅
We've identified the problem and you were right, it was a disk space issue. Thank you so much for all your help!
Prefect Community rocks!
🙌 1
:marvin: 3
:thank-you: 1
:party-parrot: 2
yay
Side note, we saw that prefect ate about 32 gb of memory in 23 hours. What does prefect store in this backend database?
1
Is there a way I can configure it to delete history older than x?
a

Apoorva Desai

07/28/2022, 3:40 PM
Thank you!