Hi we ve been having some issues with Prefect A couple of ho Prefect Community #ask-community

Hi, we've been having some issues with Prefect. A ...

Apoorva Desai

07/27/2022, 5:55 PM

Hi, we've been having some issues with Prefect. A couple of hour flows have been running about 14x the time they usually take and I can't seem to cancel and restart them. Logs in thread...

✅ 1

Apoorva Desai

07/27/2022, 5:56 PM

Copy code

Failed to set task state with error: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='prefect-apollo.data', port=4200): Read timed out. (read timeout=15)"))
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 440, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 451, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 340, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='prefect-apollo.data', port=4200): Read timed out. (read timeout=15)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 91, in call_runner_target_handlers
    state = self.client.set_task_run_state(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1598, in set_task_run_state
    result = self.graphql(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 452, in graphql
    result = <http://self.post|self.post>(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 407, in post
    response = self._request(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 641, in _request
    response = self._send_request(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 506, in _send_request
    response = <http://session.post|session.post>(
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 577, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='prefect-apollo.data', port=4200): Read timed out. (read timeout=15)

Copy code

Failed to retrieve task state with error: ClientError([{'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_or_create_task_run_info'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}])
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 154, in initialize_run
    task_run_info = self.client.get_task_run_info(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1479, in get_task_run_info
    result = self.graphql(mutation)  # type: Any
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_or_create_task_run_info'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.data:4201/graphql/> failed, reason: connect ECONNREFUSED 10.26.198.68:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]

Restarting the flow gives this error:

Copy code

Error: GraphQL error: [{'extensions': {'internal': {'statement': 'WITH "task_run_state__mutation_result_alias" AS (INSERT INTO "public"."task_run_state" ( "state", "created", "tenant_id", "task_run_id", "result", "version", "serialized_state", "start_time", "id", "updated", "message", "timestamp" ) VALUES ((\'Pending\')::varchar, DEFAULT, (\'33424649-439b-4eb5-9741-9513015f8880\')::uuid, (\'d2a6f65b-aab6-4221-93f7-356ebd6561fc\')::uuid, (NULL)::jsonb, (\'4\')::integer, (\'{"context":{},"_result":{"__version__":"1.2.0+10.gafda99411","type":"NoResultType"},"__version__":"1.2.0+10.gafda99411","cached_inputs":{},"type":"Pending","message":"null restarted this flow run"}\')::jsonb, (NULL)::timestamptz, (\'fb9fca86-1b82-420f-b19d-b8299860d5bd\')::uuid, DEFAULT, (\'null restarted this flow run\')::varchar, (\'2022-07-27T17:42:45.772814Z\')::timestamptz) RETURNING * , (\'true\')::boolean AS "check__constraint"), "task_run_state__all_columns_alias" AS (SELECT "id" , "tenant_id" , "task_run_id" , "timestamp" , "state" , "message" , "result" , "start_time" , "serialized_state" , "created" , "updated" , "version" FROM "task_run_state__mutation_result_alias" ) SELECT json_build_object(\'returning\', (SELECT coalesce(json_agg("root" ), \'[]\' ) AS "root" FROM (SELECT row_to_json((SELECT "_1_e" FROM (SELECT "_0_root.base"."id" AS "id" ) AS "_1_e" ) ) AS "root" FROM (SELECT * FROM "task_run_state__all_columns_alias" WHERE (\'true\') ) AS "_0_root.base" ) AS "_2_root" ) ) , (SELECT coalesce(bool_and("check__constraint" ), \'true\' ) FROM "task_run_state__mutation_result_alias" ) ', 'prepared': False, 'error': {'exec_status': 'FatalError', 'hint': 'Check free disk space.', 'message': 'could not extend file "base/17149/17749.2": No space left on device', 'status_code': '53100', 'description': None}, 'arguments': []}, 'path': '$.selectionSet.insert_task_run_state.args.objects', 'code': 'unexpected'}, 'message': 'database query error'}]

Apoorva Desai

07/27/2022, 5:56 PM

Trying to cancel the flows says:

Copy code

Something went wrong when trying to cancel this flow run, please try again.

Apoorva Desai

07/27/2022, 5:57 PM

And then there's also this error that I am struggling to identify the root of:

Copy code

Error getting flow run info
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/flow_runner.py", line 188, in interrupt_if_cancelling
    flow_run_info = self.client.get_flow_run_info(flow_run_id)
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1240, in get_flow_run_info
    result = self.graphql(query).data.flow_run_by_pk  # type: ignore
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'message': 'connection error', 'locations': [{'line': 2, 'column': 5}], 'path': ['flow_run_by_pk'], 'extensions': {'internal': 'FATAL:  the database system is in recovery mode\nFATAL:  the database system is in recovery mode\n', 'path': '$', 'code': 'postgres-error', 'exception': {'message': 'connection error'}}}]

Taylor Curran

07/27/2022, 6:15 PM

Hi Apoorva, what is the prefect version?

Apoorva Desai

07/27/2022, 6:21 PM

1.2.0

Apoorva Desai

07/27/2022, 7:19 PM

@Taylor Curran gentle bump as our production is broken

nicholas

07/27/2022, 7:35 PM

Hi @Apoorva Desai - can you give us a bit more info on your system? Are you running in Prefect Cloud or Prefect Server?

Apoorva Desai

07/27/2022, 7:37 PM

Prefect Server

nicholas

07/27/2022, 7:37 PM

Got it - it looks like this is an infrastructure problem; your database appears to be down.

nicholas

07/27/2022, 7:38 PM

Ah yup ok

nicholas

07/27/2022, 7:38 PM

If you look in the second message, it says that your disk space is full; your database is unable to write

Apoorva Desai

07/27/2022, 7:39 PM

The problems started yesterday at 1:52 PM PST when multiple flows were triggered multiple times at exactly the same time. I was working on syncing Fivetran through Prefect using

Fivetransync task

. This problem resolved itself automatically a little while after. Then I was able to test and implement Fivetransync through prefect successfully. It ran a few times successfully and then started failing and now I am unable to restart or cancel the flows.

nicholas

07/27/2022, 7:40 PM

Right, if your db is unable to handle transactions your flows can’t really run against the API

nicholas

07/27/2022, 7:40 PM

It’s possible you freed up some disc space with whatever you were doing yesterday but that the issue reappeared because you’re writing a lot

Apoorva Desai

07/27/2022, 7:41 PM

Which database are you suggesting is out of space? the database that Prefect uses internal to Prefect? That database?

nicholas

07/27/2022, 7:43 PM

Sorry I must be misunderstanding - if you’re running Prefect Server then the database is the one that you spun up with running

prefect server start

Apoorva Desai

07/27/2022, 7:44 PM

Ah okay, thanks. I am sharing this information internally with my team. I'll be back if I have more questions. I appreciate your help, thank you!

nicholas

07/27/2022, 7:45 PM

Understood, happy to help!

marvin 1

Apoorva Desai

07/27/2022, 7:49 PM

So my team wants to clarify that we're using the prefect provided helm charts. Does that change anything here?

Apoorva Desai

07/27/2022, 7:51 PM

All of our databases have free disk space

Apoorva Desai

07/27/2022, 7:54 PM

Is there an internal database managed by prefect that is NOT our RDS? Does

prefect server start

start that DB? Could it be that that DB is running out of space?

nicholas

07/27/2022, 7:55 PM

Hm either way you have a database pod in your k8s cluster that doesn’t seem to be working quite right; the error message you’re getting from your API hints that whatever database the API is configured to hit isn’t able to write

nicholas

07/27/2022, 7:56 PM

Yes @Apoorva Desai - check out the Prefect Server helm chart database section

Apoorva Desai

07/27/2022, 7:57 PM

Thank again, brb 😅

Apoorva Desai

07/27/2022, 8:07 PM

We've identified the problem and you were right, it was a disk space issue. Thank you so much for all your help!

Apoorva Desai

07/27/2022, 8:07 PM

Prefect Community rocks!

🙌 1

marvin 3

🙏 1

🦜 2

Apoorva Desai

07/27/2022, 8:07 PM

yay

Apoorva Desai

07/27/2022, 8:32 PM

Side note, we saw that prefect ate about 32 gb of memory in 23 hours. What does prefect store in this backend database?

✅ 1

Apoorva Desai

07/27/2022, 8:32 PM

Is there a way I can configure it to delete history older than x?

Anna Geller

07/28/2022, 12:30 AM

some resources on that: • https://discourse.prefect.io/t/how-can-i-free-up-postgres-database-space/290 • https://discourse.prefect.io/t/how-can-i-delete-flow-runs-older-than-30-days-using-graphql-api-to-clean-up-database-space-in-prefect-server/136

Apoorva Desai

07/28/2022, 3:40 PM

Thank you!

9 Views

Open in Slack

Previous Next