<@ULVA73B9P> i have 12 workers set up in a deploym...
# ask-marvin
o
@Marvin i have 12 workers set up in a deployment. 11 of them are working with out error. 1 of them runs for ~1 days and then fails. The work pool goes offline (from the prefect dashboard) and i have to restart the worker. This is what i see in the log of the failed worker. What is the problem?
Copy code
Backing off due to consecutive errors, using increased interval of  80.0s.

Failed the last 3 attempts. Please check your environment and configuration.
Examples of recent errors:

Traceback (most recent call last):
  File "C:\Prefect\venv\Lib\site-packages\prefect\utilities\services.py", line 64, in critical_service_loop
    await workload()
  File "C:\Prefect\venv\Lib\site-packages\prefect\workers\base.py", line 983, in get_and_submit_flow_runs
    runs_response = await self._get_scheduled_flow_runs()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Prefect\venv\Lib\site-packages\prefect\workers\base.py", line 1142, in _get_scheduled_flow_runs
    await self.client.get_scheduled_flow_runs_for_work_pool(
  File "C:\Prefect\venv\Lib\site-packages\prefect\client\orchestration\_work_pools\client.py", line 586, in get_scheduled_flow_runs_for_work_pool
    response = await self.request(
               ^^^^^^^^^^^^^^^^^^^
  File "C:\Prefect\venv\Lib\site-packages\prefect\client\orchestration\base.py", line 53, in request
    return await self._client.send(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Prefect\venv\Lib\site-packages\prefect\client\base.py", line 361, in send
    response.raise_for_status()
  File "C:\Prefect\venv\Lib\site-packages\prefect\client\base.py", line 162, in raise_for_status
    raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Server error '500 Internal Server Error' for url '<http://1.2.3.4:4200/api/work_pools/pool-12/get_scheduled_flow_runs>'
Response: {'exception_message': 'Internal Server Error'}
For more information check: <https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500>


An exception occurred.
Press any key to continue . . .
m
thought for 193.5 seconds
Short version: the worker isn’t the problem; your Prefect API server is. That 500 means the server errored while handling get_scheduled_flow_runs for that specific work pool. The worker then backs off, gives up after a few failures, and you see the pool/worker go “offline” in the UI until you restart it. What to do next - Check the API server logs at the time of the failure. That will show the real traceback behind the 500 and point to the root cause. Without that, we’re guessing. - Make sure server and worker are on the same, current Prefect 3.x version. Several server-side 500s have been fixed across patch releases. Upgrading both sides is a surprisingly effective “fix.” - Prefect releases: Prefect releases - Suspect bad state in that specific work pool: - If you’ve set pool-level concurrency limits or default infrastructure overrides, try removing/normalizing them and test again. - If the pool references anything that’s since been deleted (e.g., blocks in a base job template), re-save or recreate the pool. - Quick isolation test: create a fresh pool of the same type, point this worker at it, and see if the issue disappears after running for a while. If it does, migrate deployments and retire the old pool. - If you’re self-hosting with SQLite, consider Postgres for production. A DB error will surface as a 500. Check your DB logs for constraint/timeout/lock errors. - If you’re running behind a proxy, confirm it isn’t injecting 500s (less likely here, but worth ruling out). References - Endpoint involved: get_scheduled_flow_runs for work pools If you can share: - Prefect version (server and the failing worker) - How you’re hosting the API (Cloud or self-hosted, and which DB) - Worker type/OS for the problematic worker (your paths look Windows) - The API server traceback around the 500 …I can help pinpoint the exact cause instead of making you play whack‑a‑worker for another day.
Your message was too long, here's your message at the allowed limit: the server logs contain this ```The above exception was the direct cause of the following exception Traceback (most recent call last) File C \Prefect\Server\venv\Lib\site-packages\prefect\server\utilities\messaging\memory py line 357 in _consume_loop await handler(message) File C \Prefect\Server\venv\Lib\site-packages\prefect\server\services\task_run_recorder py line 213 in message_handler await record_task_run_event(event) File C \Prefect\Server\venv\Lib\site-packages\prefect\server\services\task_run_recorder py line 172 in record_task_run_event await _insert_task_run(session task_run task_run_attributes) File C \Prefect\Server\venv\Lib\site-packages\prefect\server\services\task_run_recorder py line 57 in _insert_task_run await session execute( File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\ext\asyncio\session py line 463 in execute result = await greenlet_spawn( ^^^^^^^^^^^^^^^^^^^^^ File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\util\_concurrency_py3k py line 201 in greenlet_spawn result = context throw(*sys exc_info()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\orm\session py line 2365 in execute return self _execute_internal( ^^^^^^^^^^^^^^^^^^^^^^^ File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\orm\session py line 2251 in _execute_internal result Result[Any] = compile_state_cls orm_execute_statement( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\orm\bulk_persistence py line 1294 in orm_execute_statement result = conn execute( ^^^^^^^^^^^^^ File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base py line 1415 in execute return meth( ^^^^^ File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\sql\elements py line 523 in _execute_on_connection return connection _execute_clauseelement( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base py line 1637 in _execute_clauseelement ret = self _execute_context( ^^^^^^^^^^^^^^^^^^^^^^ File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base py line 1842 in _execute_context return self _exec_single_context( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base py line 1982 in _exec_single_context self _handle_dbapi_exception( File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base py line 2351 in _handle_dbapi_exception raise sqlalchemy_exception with_traceback(exc_info[2]) from e File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base py line 1963 in _exec_single_context self dialect do_execute( File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\default py line 943 in do_execute cursor execute(statement parameters) File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\dialects\postgresql\asyncpg py line 580 in execute self _adapt_connection await_( File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\util\_concurrency_py3k py line 132 in await_only return current parent switch(awaitable) # type ignore[no-any-return attr-defined] # noqa E501 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\util\_concurrency_py3k py line 196 in greenlet_spawn value = await result ^^^^^^^^^^^^ File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\dialects\postgresql\asyncpg py line 558 in _prepare_and_execute self _handle_exception(error) File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\dialects\postgresql\asyncpg py line 508 in _handle_exception self _adapt_connection _handle_exception(error) File C \Prefect\Server\venv\Lib\site-packages\sqlalchemy\dialects\postgresql\asyncpg py line 792 in _handle_exception raise translated_error from error sqlalchemy exc DBAPIError (sqlalchemy dialects postgresql asyncpg Error) <class 'asyncpg exceptions DeadlockDetectedError'> deadlock detected DETAIL P
rocess 764039 waits for ShareLock on transaction 245134068; blocked by process 764041 Process 764041 waits for ShareLock on speculative token 2816 of transaction 245134069; blocked by process 764039 HINT See server log for query details [SQL INSERT INTO task_run (flow_run_id task_key dynamic_key flow_run_run_count empirical_policy task_inputs tags labels name run_count expected_start_time total_run_time id created updated) VALUES ($1 UUID $2 VARCHAR $3 VARCHAR $4 INTEGER $5 $6 $7 JSONB $8 JSONB $9 VARCHAR $10 INTEGER $11 TIMESTAMP WITH TIME ZONE $12 INTERVAL $13 UUID $14 TIMESTAMP WITH TIME ZONE $15 TIMESTAMP WITH TIME ZONE) ON CONFLICT (id) DO UPDATE SET flow_run_id = $16 UUID task_key = $17 VARCHAR dynamic_key = $18 VARCHAR flow_run_run_count = $19 INTEGER empirical_policy = $20 task_inputs = $21 tags = $22 JSONB labels = $23 JSONB name = $24 VARCHAR run_count = $25 INTEGER expected_start_time = $26 TIMESTAMP WITH TIME ZONE total_run_time = $27 INTERVAL id = $28 UUID updated = $29 TIMESTAMP WITH TIME ZONE WHERE task_run state_timestamp < $30 TIMESTAMP WITH TIME ZONE] [parameters ('0198c7b5-4fab-7484-ab3e-cbb71749fb91' '_process_dataset-45a1753a' '86c81155-7fa1-4dd8-b981-131760e770cd' 0 '{ max_retries 0 retry_delay_seconds 0 0 retries 0 retry_delay 0 retry_jitter_factor null}' '{
o
@Marvin the server log contains this
Copy code
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Prefect\Server\venv\Lib\site-packages\prefect\server\utilities\messaging\memory.py", line 357, in _consume_loop
    await handler(message)
  File "C:\Prefect\Server\venv\Lib\site-packages\prefect\server\services\task_run_recorder.py", line 213, in message_handler
    await record_task_run_event(event)
  File "C:\Prefect\Server\venv\Lib\site-packages\prefect\server\services\task_run_recorder.py", line 172, in record_task_run_event
    await _insert_task_run(session, task_run, task_run_attributes)
  File "C:\Prefect\Server\venv\Lib\site-packages\prefect\server\services\task_run_recorder.py", line 57, in _insert_task_run
    await session.execute(
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\ext\asyncio\session.py", line 463, in execute
    result = await greenlet_spawn(
             ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\util\_concurrency_py3k.py", line 201, in greenlet_spawn
    result = context.throw(*sys.exc_info())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\orm\session.py", line 2365, in execute
    return self._execute_internal(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\orm\session.py", line 2251, in _execute_internal
    result: Result[Any] = compile_state_cls.orm_execute_statement(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\orm\bulk_persistence.py", line 1294, in orm_execute_statement
    result = conn.execute(
             ^^^^^^^^^^^^^
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base.py", line 1415, in execute
    return meth(
           ^^^^^
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\sql\elements.py", line 523, in _execute_on_connection
    return connection._execute_clauseelement(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base.py", line 1637, in _execute_clauseelement
    ret = self._execute_context(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base.py", line 1842, in _execute_context
    return self._exec_single_context(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base.py", line 1982, in _exec_single_context
    self._handle_dbapi_exception(
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base.py", line 2351, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\base.py", line 1963, in _exec_single_context
    self.dialect.do_execute(
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\engine\default.py", line 943, in do_execute
    cursor.execute(statement, parameters)
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\dialects\postgresql\asyncpg.py", line 580, in execute
    self._adapt_connection.await_(
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\util\_concurrency_py3k.py", line 132, in await_only
    return current.parent.switch(awaitable)  # type: ignore[no-any-return,attr-defined] # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\util\_concurrency_py3k.py", line 196, in greenlet_spawn
    value = await result
            ^^^^^^^^^^^^
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\dialects\postgresql\asyncpg.py", line 558, in _prepare_and_execute
    self._handle_exception(error)
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\dialects\postgresql\asyncpg.py", line 508, in _handle_exception
    self._adapt_connection._handle_exception(error)
  File "C:\Prefect\Server\venv\Lib\site-packages\sqlalchemy\dialects\postgresql\asyncpg.py", line 792, in _handle_exception
    raise translated_error from error
sqlalchemy.exc.DBAPIError: (sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.DeadlockDetectedError'>: deadlock detected
DETAIL:  Process 764039 waits for ShareLock on transaction 245134068; blocked by process 764041.
Process 764041 waits for ShareLock on speculative token 2816 of transaction 245134069; blocked by process 764039.
HINT:  See server log for query details.
m
thought for 334.6 seconds
Ah, deadlocks. Postgres’s way of asking Prefect to be a little less enthusiastic. What your log shows - It’s a Postgres deadlock during an upsert into
task_run
from the TaskRunRecorder service: -
INSERT ... ON CONFLICT (id) DO UPDATE ... WHERE task_run.state_timestamp < ...
- Two transactions are waiting on each other (ShareLock + speculative insert), classic concurrent upsert contention. - The “CancellationCleanup took longer than its loop interval” warnings are a symptom of server services getting slowed by DB contention. Known reports and references - Deadlocks reported after upgrading to 3.3.x: prefect#17767 - Another related performance/locking thread: prefect#17895 - Recent release note touching DB lock behavior (not the same code path, but still worth upgrading): 3.4.13 (see “Fix DB locks during DB clear” PR #18684) Quick triage questions - Which Prefect version are you running? - Postgres version? - Single API/server instance or multiple replicas? - Any custom server DB settings (pool size, timeouts) in your profile? Mitigations to try 1) Upgrade Prefect to the latest 3.4.x - Several DB and service loop tweaks have landed since 3.3.x. It won’t hurt, and it might help. 2) Keep a single API server instance - Prefect Server isn’t HA; running multiple API pods/processes can amplify contention on upserts. 3) Right-size SQLAlchemy connection pool (too high can make deadlocks worse) - Start conservative and adjust:
Copy code
PREFECT_SERVER_DATABASE_SQLALCHEMY__POOL_SIZE=20
PREFECT_SERVER_DATABASE_SQLALCHEMY__MAX_OVERFLOW=10
PREFECT_SERVER_DATABASE_SQLALCHEMY__POOL_TIMEOUT=30
- If you’ve cranked pool_size up (e.g., 60+), dial it back and observe. 4) Ensure default isolation level (READ COMMITTED) - Avoid SERIALIZABLE/REPEATABLE READ on the connection or DB. If you’ve customized isolation, undo it. 5) Postgres hygiene - Make sure autovacuum is healthy; vacuum/analyze tables (especially
task_run
) to reduce bloat and lock times. - Monitor locking to identify the culprits:
Copy code
SELECT bl.pid AS blocked_pid, ka.query AS blocking_query, a.query AS blocked_query,
       now() - a.query_start AS blocked_duration
FROM pg_locks bl
JOIN pg_stat_activity a ON a.pid = bl.pid
JOIN pg_locks kl ON kl.transactionid = bl.transactionid AND kl.pid != bl.pid
JOIN pg_stat_activity ka ON ka.pid = kl.pid
WHERE NOT bl.granted;
6) Reduce service churn while debugging - Increase loop intervals temporarily to lower pressure:
Copy code
PREFECT_SERVER_SERVICES__CANCELLATION_CLEANUP__LOOP_SECONDS=60
PREFECT_SERVER_SERVICES__RECENT_DEPLOYMENTS_SCHEDULER__LOOP_SECONDS=10
- These warnings are not fatal but indicate the server is lagging. If the deadlocks persist after upgrading and tuning, we should capture more detail (enable SQLAlchemy echo and Prefect server DEBUG logs) and attach them to the existing issue for visibility. Want me to suggest a minimal set of environment variables for your setup based on your current profile?