https://prefect.io logo
Title
d

David Steiner Sand

02/14/2023, 12:44 PM
I’ve been using prefect in production for weeks (I’m not using prefect cloud) and today my prefect server seemed very slow. Looking at the logs I saw these warnings:
12:22:49.567 | WARNING | prefect.orion.services.failexpiredpauses - FailExpiredPauses took 5.385518 seconds to run, which is longer than its loop interval of 5.0 seconds.
12:22:49.881 | WARNING | prefect.orion.services.flowrunnotifications - FlowRunNotifications took 5.689214 seconds to run, which is longer than its loop interval of 4 seconds.
12:22:50.319 | WARNING | prefect.orion.services.marklateruns - MarkLateRuns took 6.143901 seconds to run, which is longer than its loop interval of 5.0 seconds.
12:22:55.451 | WARNING | prefect.orion.services.marklateruns - MarkLateRuns took 5.129949 seconds to run, which is longer than its loop interval of 5.0 seconds.
12:22:55.669 | WARNING | prefect.orion.services.failexpiredpauses - FailExpiredPauses took 6.088298 seconds to run, which is longer than its loop interval of 5.0 seconds.
Looks like there are some timeouts happening. I’m using a DNS for the prefect server, and both the prefect server and agents are pointing to this DNS. So the requests might be slow due to it. My question would be, how can I avoid these warnings (increase the timeouts)? I’ve searched the prefect settings (https://docs.prefect.io/api-ref/prefect/settings/) for envs which have default values of
4
and
5
seconds (as seen in the logs) and I found this ones:
PREFECT_ORION_API_KEEPALIVE_TIMEOUT
PREFECT_ORION_SERVICES_PAUSE_EXPIRATIONS_LOOP_SECONDS
PREFECT_ORION_SERVICES_LATE_RUNS_LOOP_SECONDS
PREFECT_ORION_DATABASE_CONNECTION_TIMEOUT
Increasing them to 20 seconds did not seem to make the warning dissapear. Any other suggestions? Behavior observer with prefect docker images 2.7.8 and 2.8.0. Thanks for your awesome support in advance 🙂
By the way my prefect server logs are flooded with logs, as if errors were happening all the time. E.g.:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/services/loop_service.py", line 78, in start
    await self.run_once()
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/database/dependencies.py", line 117, in async_wrapper
    return await fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/services/pause_expirations.py", line 53, in run_once
    result = await session.execute(query)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/ext/asyncio/session.py", line 214, in execute
    result = await greenlet_spawn(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 126, in greenlet_spawn
    result = context.throw(*sys.exc_info())
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1714, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1572, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2128, in _handle_dbapi_exception
    util.raise_(exc_info[1], with_traceback=exc_info[2])
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 479, in execute
    self._adapt_connection.await_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 68, in await_only
    return current.driver.switch(awaitable)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 121, in greenlet_spawn
    value = await result
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 454, in _prepare_and_execute
    self._handle_exception(error)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 389, in _handle_exception
    self._adapt_connection._handle_exception(error)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 684, in _handle_exception
    raise error
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 418, in _prepare_and_execute
    prepared_stmt, attributes = await adapt_connection._prepare(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 660, in _prepare
    prepared_stmt = await self._connection.prepare(operation)
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 565, in prepare
    return await self._prepare(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 583, in _prepare
    stmt = await self._get_statement(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 412, in _get_statement
    types, intro_stmt = await self._introspect_types(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 458, in _introspect_types
    return await self.__execute(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 1683, in __execute
    return await self._do_execute(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 1733, in _do_execute
    result = await executor(stmt, timeout)
  File "asyncpg/protocol/protocol.pyx", line 201, in bind_execute
asyncio.exceptions.TimeoutError
12:33:16.102 | WARNING | prefect.orion.services.failexpiredpauses - FailExpiredPauses took 10.913225 seconds to run, which is longer than its loop interval of 5.0 seconds.
12:33:16.108 | ERROR   | prefect.orion.services.marklateruns - Unexpected error in: TimeoutError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/services/loop_service.py", line 78, in start
    await self.run_once()
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/database/dependencies.py", line 117, in async_wrapper
    return await fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/services/late_runs.py", line 67, in run_once
    result = await session.execute(query)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/ext/asyncio/session.py", line 214, in execute
    result = await greenlet_spawn(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 126, in greenlet_spawn
    result = context.throw(*sys.exc_info())
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1714, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1572, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2128, in _handle_dbapi_exception
    util.raise_(exc_info[1], with_traceback=exc_info[2])
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 479, in execute
    self._adapt_connection.await_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 68, in await_only
    return current.driver.switch(awaitable)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 121, in greenlet_spawn
    value = await result
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 454, in _prepare_and_execute
    self._handle_exception(error)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 389, in _handle_exception
    self._adapt_connection._handle_exception(error)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 684, in _handle_exception
    raise error
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 418, in _prepare_and_execute
    prepared_stmt, attributes = await adapt_connection._prepare(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 660, in _prepare
    prepared_stmt = await self._connection.prepare(operation)
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 565, in prepare
    return await self._prepare(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 583, in _prepare
    stmt = await self._get_statement(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 412, in _get_statement
    types, intro_stmt = await self._introspect_types(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 458, in _introspect_types
    return await self.__execute(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 1683, in __execute
    return await self._do_execute(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 1733, in _do_execute
    result = await executor(stmt, timeout)
  File "asyncpg/protocol/protocol.pyx", line 201, in bind_execute
asyncio.exceptions.TimeoutError
12:33:16.113 | WARNING | prefect.orion.services.marklateruns - MarkLateRuns took 10.926965 seconds to run, which is longer than its loop interval of 5.0 seconds.
12:33:17.372 | ERROR   | prefect.orion - Encountered exception in request:
This happens even when no prefect agent is running.
z

Zanie

02/14/2023, 3:54 PM
Hm this seems weird. Is your database getting large? It’s possible these services are just taking a long time to process the data.
Are you using Postgres?
Also, increasing the LOOP_SECONDS for these services resulted in the same error messages? Or did the “which is longer than its loop interval of ” change?
d

David Steiner Sand

02/14/2023, 6:12 PM
Update: now it seems to work again, but I didn’t change anything. Yes, I’m using postgres, but size doesn’t seem to be the problem: I also tried using a new empty database and the error persisted. Increasing the
LOOP_SECONDS
envs didn’t seem to change anything, as the messages
which is longer than its loop interval of
still appeared and showed the same values (4 and 5 seconds).
z

Zanie

02/14/2023, 6:17 PM
Hm very weird. How are you running your server?
d

David Steiner Sand

02/15/2023, 12:05 AM
In the docker image
prefecthq/prefect:2.7.12-python3.9
with the command:
PREFECT_API_URL="<https://my-dns-host/api>" PREFECT_ORION_DATABASE_CONNECTION_URL="postgresql+asyncpg:my-postgres-host/db"  PREFECT_ORION_DATABASE_CONNECTION_TIMEOUT="20" PREFECT_ORION_DATABASE_TIMEOUT="10" prefect orion start --host=0.0.0.0
Today I saw this new error:
11:20:09.156 | ERROR   | prefect.orion.services.recentdeploymentsscheduler - Unexpected error in: CannotConnectNowError('the database system is in recovery mode')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/services/loop_service.py", line 78, in start
    await self.run_once()
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/database/dependencies.py", line 117, in async_wrapper
    return await fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/prefect/orion/services/scheduler.py", line 89, in run_once
    result = await session.execute(query)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/ext/asyncio/session.py", line 214, in execute
    result = await greenlet_spawn(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 126, in greenlet_spawn
    result = context.throw(*sys.exc_info())
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1713, in execute
    conn = self._connection_for_bind(bind)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind
    return self._transaction._connection_for_bind(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 747, in _connection_for_bind
    conn = bind.connect()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/future/engine.py", line 406, in connect
    return super(Engine, self).connect()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3315, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3394, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 327, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 498, in checkout
    rec._checkin_failed(err, _fairy_was_created=False)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
    dbapi_connection = rec.get_connection()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 630, in get_connection
    self.__connect()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 691, in __connect
    pool.logger.debug("Error on connect(): %s", e)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 578, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 780, in connect
    await_only(self.asyncpg.connect(*arg, **kw)),
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 68, in await_only
    return current.driver.switch(awaitable)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 121, in greenlet_spawn
    value = await result
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 2092, in connect
    return await connect_utils._connect(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connect_utils.py", line 881, in _connect
    return await _connect_addr(
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connect_utils.py", line 773, in _connect_addr
    return await __connect_addr(params, timeout, True, *args)
  File "/usr/local/lib/python3.9/site-packages/asyncpg/connect_utils.py", line 831, in __connect_addr
    await compat.wait_for(connected, timeout=timeout)
  File "/usr/local/lib/python3.9/site-packages/asyncpg/compat.py", line 56, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 479, in wait_for
    return fut.result()
asyncpg.exceptions.CannotConnectNowError: the database system is in recovery mode
Some postgres logs:
2023-02-15 11:24:42.952 UTC [1] DETAIL:  Failed process was running: WITH RECURSIVE typeinfo_tree(
	    oid, ns, name, kind, basetype, elemtype, elemdelim,
	    range_subtype, attrtypoids, attrnames, depth)
	AS (
	    SELECT
	        ti.oid, ti.ns, ti.name, ti.kind, ti.basetype,
	        ti.elemtype, ti.elemdelim, ti.range_subtype,
	        ti.attrtypoids, ti.attrnames, 0
	    FROM
	            (
	        SELECT
	            t.oid                           AS oid,
	            ns.nspname                      AS ns,
	            t.typname                       AS name,
	            t.typtype                       AS kind,
	            (CASE WHEN t.typtype = 'd' THEN
	                (WITH RECURSIVE typebases(oid, depth) AS (
	                    SELECT
	                        t2.typbasetype      AS oid,
	                        0                   AS depth
	                    FROM
	                        pg_type t2
	                    WHERE
	                        t2.oid = t.oid
	
	                    UNION ALL
	
	                    SELECT
	                        t2.typbasetype      AS oid,
	      
2023-02-15 11:24:42.952 UTC [1] LOG:  terminating any other active server processes
postgres: postgres orion 192.168.2.35(54236) SELECT: malloc.c:2379: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
2023-02-15 11:24:43.166 UTC [145] FATAL:  the database system is in recovery mode
2023-02-15 11:24:43.606 UTC [146] FATAL:  the database system is in recovery mode
2023-02-15 11:24:44.947 UTC [147] FATAL:  the database system is in recovery mode
z

Zanie

02/15/2023, 7:48 PM
It looks like your database server is completely broken; are you allocating enough memory to it?
I’ve never seen issues like this.
d

David Steiner Sand

02/15/2023, 8:19 PM
@Zanie after some time I think I found the problem: Looks like one of my flow runs generates a huge amount of logs (which is inserted in the prefect database by the prefect server). Seems like the DB cannot handle well the operation to insert the logs and it fails, showing the error above. The DB also fails to retrieve the logs, so I am able to replicate the problem like this: I go to the UI (see image), everything works well and the first logs appear after a while. But when I click on “Newest to oldest”, triggering the DB to revert the order of the logs and send it to the prefect server, the DB fails and the prefect server goes down. So in the end it looks like a problem in the DB, not in prefect.
And yes, there is enough memory allocated for the DB.
z

Zanie

02/15/2023, 8:26 PM
Ah okay great thanks for sharing!
What order of magnitude of logs?
d

David Steiner Sand

02/15/2023, 8:39 PM
I’m not really sure… but I’ll let you know if I find out!
s

Scott Walsh

04/25/2023, 5:17 PM
@Zanie @David Steiner Sand Is this really a DB issue and not an issue with how Prefect interacts with the DB? I am having a similar issue with a flow that generates tens of thousands of logs. Although I can just reduce the number of logs created, it seems like there is an opportunity for Prefect Server to be more smart in how it manages this scenario.
z

Zanie

04/25/2023, 5:25 PM
I mean, it's hard to say it's not both. We use pubsub in Prect Cloud because the database isn't a great place for high magnitudes of logs but we don't want to add more infrastructure to the open source stack.
👍 2
z

Zachary Lee

05/02/2023, 2:28 PM
Fwiw I was seeing a similar issue -- would get timeouts and these loop interval warnings when making queries via the flow runs page in the UI. Initially I deleted a bunch of old flow runs which helped, but these were still happening. I'm self-hosting on kubernetes and noticed the orion pod's limits (mainly cpu) were set kind of low. I increased them and now I seem to have eliminated the warnings + timeouts I used to see when making big queries via flow runs page in UI!