<@ULVA73B9P> I run a self-hosted Prefect 3 instanc...
# ask-marvin
v
@Marvin I run a self-hosted Prefect 3 instance using docker-compose. I run two containers, one with
no-services
and one with services only. I have 42 flows deployed polled by a single worker and run 2 to 5 flows each minute. The services container is eating all available CPU, is that normal ? How come it is so greedy ? Any tips to reduce the resource consumption of the system ?
m
thought for 287.1 seconds
Short answer: it’s not “expected” that a services-only container pegs all CPU for a small workload, but it can happen if the service loops are very aggressive or you’re processing lots of events/schedules. The usual culprits are the scheduler and the event/recording services. You can tune them down, disable what you don’t use, and optionally split or cap resources. What the “services” are doing - Scheduler: computes and creates upcoming flow runs from deployments’ schedules - Task run recorder / event consumer: ingests events emitted by runs and tasks for UI/state consistency - Foreman: enforces concurrency limits/occupancies - Late run marker: marks late runs - Pause expiration: fails/cleans up expired pauses - Repossessor: reclaims “stuck” runs - Telemetry: optional metrics sending Why CPU can spike - Tight loop intervals for scheduler/recorders (short sleeps mean more DB work and Python processing) - Many frequent schedules (e.g., per-minute across many deployments) - Lots of state change events (even 2–5 runs/min with many tasks each can generate a lot of events) - Using a single services process to do everything (noisy neighbors within that process) - DB contention (especially if using SQLite) causing retries/spin that show up as CPU in the Python process Practical ways to reduce services CPU 1) Slow the loops down a bit (most impact, minimal side effects) - Each service has a loop interval setting you can raise a little without materially impacting latency. - Environment variables follow this pattern: -
PREFECT_SERVER_SERVICES_<SERVICE>_LOOP_SECONDS
- Example knobs that are typically safe to increase: - Scheduler:
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=5
or
10
- Task recorder:
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_LOOP_SECONDS=5
- Repossessor:
PREFECT_SERVER_SERVICES_REPOSSESSOR_LOOP_SECONDS=15
- Late runs:
PREFECT_SERVER_SERVICES_LATE_RUNS_LOOP_SECONDS=15
- Pause expirations:
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_LOOP_SECONDS=60
- Foreman:
PREFECT_SERVER_SERVICES_FOREMAN_LOOP_SECONDS=5
(or disable if not using concurrency) - Full settings reference: Prefect settings reference 2) Disable services you don’t use - If you don’t use concurrency limits, disable the Foreman: -
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
- Telemetry is optional in most self-hosted setups: -
PREFECT_SERVER_SERVICES_TELEMETRY_ENABLED=false
- Pattern:
PREFECT_SERVER_SERVICES_<SERVICE>_ENABLED=false
- See service list here: Services API reference 3) Reduce schedule density where feasible - If you have many per-minute schedules, consider staggering or batching them. - A single deployment with a cron schedule and parameterization can sometimes replace many similar minutely schedules. 4) Use Postgres for the API database (if not already) - SQLite can cause contention under write-heavy workloads. Postgres typically improves throughput and reduces CPU in the Python process. 5) Cap and/or split resources at the container level - It’s reasonable to cap the services container so it doesn’t starve the rest of your stack:
Copy code
services:
  prefect-services:
    cpus: "2.0"
    mem_limit: "2g"
- If you want, you can run multiple services containers and split the services (e.g., one for scheduler + recorder, another for the rest) to isolate hotspots. The CLI supports starting services independently — check the help for current flags and service names: - Docs: prefect server CLI - If you prefer a single command to run the API/UI:
prefect server start --no-services
(as you’re doing) and run services in a separate container. 6) Turn down logging verbosity -
PREFECT_LOGGING_LEVEL=INFO
(or
WARNING
) in the services container can reduce overhead. 7) Upgrade to the latest Prefect 3.x
- We regularly ship performance improvements to services and event processing. Release notes: Docs home → Release notes Suggested docker-compose snippet - Start conservative and adjust upward if you notice scheduling latency you care about:
Copy code
environment:
  - PREFECT_LOGGING_LEVEL=INFO
  - PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
  - PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_LOOP_SECONDS=5
  - PREFECT_SERVER_SERVICES_REPOSSESSOR_LOOP_SECONDS=15
  - PREFECT_SERVER_SERVICES_LATE_RUNS_LOOP_SECONDS=15
  - PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_LOOP_SECONDS=60
  # Disable services you do not use:
  - PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
  - PREFECT_SERVER_SERVICES_TELEMETRY_ENABLED=false
How to diagnose which service is hot - Temporarily set
PREFECT_LOGGING_LEVEL=DEBUG
on the services container and watch logs — each service logs its loop ticks and timings which can reveal the busiest loop. - If using Postgres, also check DB CPU; the services process can appear busy if the DB is slow and it’s doing many queries. A few links for reference - Self-hosting overview: Self-hosted guide - Services API reference (list of services): prefect.server.services - Server CLI (including services commands): prefect server CLI - Settings reference: Settings reference To help tailor this, could you share: - Your Prefect version (e.g., 3.1.x) - DB backend (SQLite vs Postgres) - Whether you use concurrency limits or automations - A redacted snippet of the docker-compose for the services container (env and command) - Any logs showing which service is looping frequently With that, I can suggest a minimal set of exact settings to change for your setup.
v
@Marvin I turned on DEBUG level logging in the prefect-services container. I do not see any service being particularly chatty but I do see som timeout errors like this one :
Copy code
12:14:32.852 | ERROR   | prefect.server.services.recentdeploymentsscheduler - Unexpected error in: TimeoutError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/asyncpg/connection.py", line 2421, in connect
    return await connect_utils._connect(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/asyncpg/connect_utils.py", line 1049, in _connect
    conn = await _connect_addr(
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/asyncpg/connect_utils.py", line 886, in _connect_addr
    return await __connect_addr(params, True, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/asyncpg/connect_utils.py", line 931, in __connect_addr
    tr, pr = await connector
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/asyncpg/connect_utils.py", line 802, in _create_ssl_connection
    tr, pr = await loop.create_connection(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 1159, in create_connection
    transport, protocol = await self._create_connection_transport(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 1192, in _create_connection_transport
    await waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/prefect/server/services/base.py", line 242, in start
    await self.run_once()
  File "/usr/local/lib/python3.12/site-packages/prefect/server/services/scheduler.py", line 112, in run_once
    inserted_runs = await self._insert_scheduled_flow_runs(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/prefect/server/services/scheduler.py", line 300, in _insert_scheduled_flow_runs
    return await models.deployments._insert_scheduled_flow_runs(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/prefect/server/models/deployments.py", line 818, in _insert_scheduled_flow_runs
    await session.execute(
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/ext/asyncio/session.py", line 463, in execute
    result = await greenlet_spawn(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 201, in greenlet_spawn
    result = context.throw(*sys.exc_info())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[...]
TimeoutError
Or this one :
Copy code
12:12:00.391 | ERROR   | prefect.server.events.triggers - Error running periodic evaluation
[...]
Can you tell me what's wrong ?
m
Your message was too long, here's your message at the allowed limit: I turned on DEBUG level logging in the prefect-services container I do not see any service being particularly chatty but I do see som timeout errors like this one ```12 14 32 852 | ERROR | prefect server services recentdeploymentsscheduler - Unexpected error in TimeoutError() Traceback (most recent call last) File /usr/local/lib/python3 12/site-packages/asyncpg/connection py line 2421 in connect return await connect_utils _connect( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/asyncpg/connect_utils py line 1049 in _connect conn = await _connect_addr( ^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/asyncpg/connect_utils py line 886 in _connect_addr return await __connect_addr(params True *args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/asyncpg/connect_utils py line 931 in __connect_addr tr pr = await connector ^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/asyncpg/connect_utils py line 802 in _create_ssl_connection tr pr = await loop create_connection( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/asyncio/base_events py line 1159 in create_connection transport protocol = await self _create_connection_transport( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/asyncio/base_events py line 1192 in _create_connection_transport await waiter asyncio exceptions CancelledError The above exception was the direct cause of the following exception Traceback (most recent call last) File /usr/local/lib/python3 12/site-packages/prefect/server/services/base py line 242 in start await self run_once() File /usr/local/lib/python3 12/site-packages/prefect/server/services/scheduler py line 112 in run_once inserted_runs = await self _insert_scheduled_flow_runs( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/prefect/server/services/scheduler py line 300 in _insert_scheduled_flow_runs return await models deployments _insert_scheduled_flow_runs( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/prefect/server/models/deployments py line 818 in _insert_scheduled_flow_runs await session execute( File /usr/local/lib/python3 12/site-packages/sqlalchemy/ext/asyncio/session py line 463 in execute result = await greenlet_spawn( ^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/util/_concurrency_py3k py line 201 in greenlet_spawn result = context throw(*sys exc_info()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/orm/session py line 2365 in execute return self _execute_internal( ^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/orm/session py line 2241 in _execute_internal conn = self _connection_for_bind(bind) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/orm/session py line 2110 in _connection_for_bind return trans _connection_for_bind(engine execution_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File <string> line 2 in _connection_for_bind File /usr/local/lib/python3 12/site-packages/sqlalchemy/orm/state_changes py line 137 in _go ret_value = fn(self *arg *kw) ^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/orm/session py line 1189 in _connection_for_bind conn = bind connect() ^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/engine/base py line 3277 in connect return self _connection_cls(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/engine/base py line 143 in init self _dbapi_connection = engine raw_connection() ^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/engine/base py line 3301 in raw_connection return self pool connect() ^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python
3 12/site-packages/sqlalchemy/pool/base py line 447 in connect return _ConnectionFairy _checkout(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/pool/base py line 1264 in _checkout fairy = _ConnectionRecord checkout(pool) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/pool/base py line 711 in checkout rec = pool _do_get() ^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/pool/impl py line 177 in _do_get with util safe_reraise() ^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/util/langhelpers py line 224 in exit raise exc_value with_traceback(exc_tb) File /usr/local/lib/python3 12/site-packages/sqlalchemy/pool/impl py line 175 in _do_get return self _create_connection() ^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/pool/base py line 388 in _create_connection return _ConnectionRecord(self) ^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/pool/base py line 673 in init self __connect() File /usr/local/lib/python3 12/site-packages/sqlalchemy/pool/base py line 899 in __connect with util safe_reraise() ^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/util/langhelpers py line 224 in exit raise exc_value with_traceback(exc_tb) File /usr/local/lib/python3 12/site-packages/sqlalchemy/pool/base py line 895 in __connect self dbapi_connection = connection = pool _invoke_creator(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/engine/create py line 661 in connect return dialect connect(*cargs *cparams) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File /usr/local/lib/python3 12/site-packages/sqlalchemy/engine/default py
thought for 124.9 seconds
Thanks for the logs — they’re very telling. What they indicate - The error is occurring while the scheduler tries to connect to your database via asyncpg: -
prefect.server.services.recentdeploymentsscheduler
→ asyncpg connect → TimeoutError/CancelledError - The “periodic evaluation” error in
prefect.server.events.triggers
is likely the same root cause (DB access timing out) - When services can’t connect promptly, they’ll keep retrying in their loops. Even if logs aren’t “chatty,” those retries can keep the event loop busy and show up as high CPU in the services container. Most common root causes - Postgres isn’t responding quickly enough (load, I/O, or too few connections available) - Connection pool exhaustion (default pool is small) - Too short DB timeouts (connect/statement) - SSL handshake issues (trying SSL when not needed/allowed) - DNS/network latency between the services container and Postgres - SQLite (if used) under write pressure — but your stack trace shows asyncpg, so you’re on Postgres What to check and change 1) Confirm DB and connectivity basics - Ensure the URL is async and points to the right host: -
PREFECT_SERVER_DATABASE_CONNECTION_URL=<postgresql+asyncpg://user:pass@postgres:5432/prefect>
- If Postgres is on the same Docker network and you don’t require TLS, disable SSL to avoid slow handshakes: - Append
?sslmode=disable
to the connection URL: -
<postgresql+asyncpg://user:pass@postgres:5432/prefect?sslmode=disable>
- Verify Postgres logs around the same timestamps — look for connection timeouts, too many connections, or slow queries. 2) Increase connection/pool settings for the services container - Prefect 3 exposes these as settings (mapped to AsyncPostgresConfiguration). In the services container, set the “server” variants:
Copy code
# services container
PREFECT_SERVER_DATABASE_CONNECTION_URL=<postgresql+asyncpg://user:pass@postgres:5432/prefect?sslmode=disable>
PREFECT_SERVER_DATABASE_CONNECTION_TIMEOUT=30   # connect timeout (seconds)
PREFECT_SERVER_DATABASE_TIMEOUT=60              # statement timeout (seconds)
PREFECT_SERVER_DATABASE_SQLALCHEMY_POOL_SIZE=10
PREFECT_SERVER_DATABASE_SQLALCHEMY_MAX_OVERFLOW=20
- In the API/UI container, use the same but with
PREFECT_API_DATABASE_*
prefix. References: - Settings reference - Database configuration API 3) If you don’t need TLS between containers - Disabling SSL (see
sslmode=disable
) often removes connect latency and handshake failures inside Compose networks. 4) Ensure Postgres can accept the load - Check
max_connections
and whether you’re hitting it. - Monitor Postgres CPU and I/O. If it’s pegged, the services will time out waiting for connections. - If you have many Prefect components connecting, consider pgbouncer, but most small installs are fine with a larger pool and max_overflow. 5) Slow service loops slightly to reduce pressure - Raising loop intervals reduces how aggressively services re-attempt DB work:
Copy code
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_LOOP_SECONDS=5
PREFECT_SERVER_SERVICES_REPOSSESSOR_LOOP_SECONDS=15
PREFECT_SERVER_SERVICES_LATE_RUNS_LOOP_SECONDS=15
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_LOOP_SECONDS=60
# If you do not use concurrency limits:
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
- Docs: Services reference and Settings reference 6) Cap or split the services container if needed - It’s okay to cap CPU so retries don’t starve the node while you tune:
Copy code
cpus: "2.0"
mem_limit: "2g"
- You can also split services into separate containers to isolate hot loops if desired. Why this should help
- Increasing
connection_timeout
and the pool size gives the scheduler/trigger evaluators more headroom to obtain connections. - Disabling SSL (if not required) avoids slow TLS handshakes you’re seeing in
_create_ssl_connection
. - Slightly slower loops + fewer retries reduce CPU churn when the DB is slow. - Postgres-side checks ensure there’s not a true DB bottleneck causing these timeouts. If you can share: - Your services container env block (redacted URL/user) - Postgres version and how it’s deployed in compose - Whether you need TLS between containers I can suggest the exact minimal set of env vars to add to your compose file.
v
@Marvin there are indeed slow queries, commonly over 10 seconds. This is strange as it happens even on a freshly wiped out database (no logs, events or flow_runs history). What could case a fresh Prefect install to have slow queries ? This is the docker-compose file I use :
Copy code
services:
  postgres:
    image: postgres:15
    container_name: prefect-db
    environment:
      POSTGRES_USER: prefect
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:?error}
      POSTGRES_DB: prefect
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U prefect"]
      interval: 10s
      timeout: 5s
      retries: 5
    ports:
      - "5432:5432"
    restart: always

  redis:
    image: redis:7

  prefect-api:
    image: prefecthq/prefect:3-python3.12
    container_name: prefect-server
    depends_on:
      postgres:
        condition: service_healthy
      migrate:
        condition: service_completed_successfully
      redis:
        condition: service_started
    environment:
      # Prefect API and UI configuration
      PREFECT_API_URL: ${PREFECT_API_URL:?error}
      PREFECT_UI_URL: ${PREFECT_UI_URL:?error}
      
      # Database configuration
      PREFECT_SERVER_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect>:${POSTGRES_PASSWORD:?error}@postgres:5432/prefect
      PREFECT_API_DATABASE_MIGRATE_ON_START: "false"

      # Redis
      PREFECT_MESSAGING_BROKER: prefect_redis.messaging
      PREFECT_MESSAGING_CACHE: prefect_redis.messaging
      PREFECT_SERVER_EVENTS_CAUSAL_ORDERING: prefect_redis.ordering
      PREFECT_SERVER_CONCURRENCY_LEASE_STORAGE: prefect_redis.lease_storage
      PREFECT_REDIS_MESSAGING_HOST: redis
      PREFECT_REDIS_MESSAGING_PORT: "6379"
    ports:
      - "4200:4200"
    restart: always
    command: ["prefect", "server", "start", "--host", "0.0.0.0", "--no-services"]

  migrate:
    image: prefecthq/prefect:3-python3.12
    depends_on:
      postgres:
        condition: service_healthy
    command: prefect server database upgrade -y
    environment:
      PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect>:${POSTGRES_PASSWORD:?error}@postgres:5432/prefect

  prefect-background:
    image: prefecthq/prefect:3-python3.12
    container_name: prefect-background-services
    depends_on:
      postgres:
        condition: service_healthy
      migrate:
        condition: service_completed_successfully
      redis:
        condition: service_started
    command: prefect server services start
    environment:
      PREFECT_LOGGING_LEVEL: "DEBUG"
      PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect>:${POSTGRES_PASSWORD:?error}@postgres:5432/prefect
      PREFECT_API_DATABASE_MIGRATE_ON_START: "false"
      PREFECT_MESSAGING_BROKER: prefect_redis.messaging
      PREFECT_MESSAGING_CACHE: prefect_redis.messaging
      PREFECT_SERVER_EVENTS_CAUSAL_ORDERING: prefect_redis.ordering
      PREFECT_SERVER_CONCURRENCY_LEASE_STORAGE: prefect_redis.lease_storage
      PREFECT_REDIS_MESSAGING_HOST: redis
      PREFECT_REDIS_MESSAGING_PORT: "6379"

volumes:
  postgres_data:
m
thought for 80.7 seconds
Thanks — this helps narrow it down. From your logs: - The scheduler is failing while trying to connect/run DB work (asyncpg TimeoutError/CancelledError) - Events/triggers periodic evaluation is erroring too — also DB-driven - You also observe “slow queries” (~10s+) A few issues in your compose stand out that can cause exactly this behavior. 1) API vs SERVER DB settings are swapped - In your API container you set
PREFECT_SERVER_DATABASE_CONNECTION_URL
- In your services container you set
PREFECT_API_DATABASE_CONNECTION_URL
In Prefect 3.x: - The API/UI process reads
PREFECT_API_DATABASE_*
- The background services process reads
PREFECT_SERVER_DATABASE_*
When these are swapped, each process may fall back to defaults or attempt SSL handshakes unexpectedly, causing connect timeouts and slow queries. 2) SSL handshake overhead inside Docker Your stack trace shows
_create_ssl_connection
during connect. Within a trusted Docker network you can usually disable SSL to avoid slow/aborted handshakes. 3) Default connection pool and timeout values can be too tight Defaults are conservative (small pool and short timeouts) and can lead to statement/connection timeouts under modest load or when the DB is starved of CPU. 4) CPU starvation feedback loop If the services container pegs CPU, Postgres can’t process queries fast enough, causing timeouts that make services retry aggressively, which keeps the CPU pegged. Give Postgres some reserved CPU and cap the services container. What I recommend changing immediately A) Fix env var prefixes and disable SSL on the internal network - API/UI container should use PREFECT_API_DATABASE_*. - Services container should use PREFECT_SERVER_DATABASE_*. - Add
sslmode=disable
to the asyncpg URL (safe on the internal Docker network). - Increase timeouts and pool sizes a bit. Example adjustments:
Copy code
# prefect-api
environment:
  PREFECT_API_URL: ${PREFECT_API_URL:?error}
  PREFECT_UI_URL: ${PREFECT_UI_URL:?error}

  # Use API_* here
  PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect:${POSTGRES_PASSWORD}@postgres:5432/prefect?sslmode=disable>
  PREFECT_API_DATABASE_CONNECTION_TIMEOUT: "30"      # connect timeout (s)
  PREFECT_API_DATABASE_TIMEOUT: "60"                 # statement timeout (s)
  PREFECT_API_DATABASE_SQLALCHEMY_POOL_SIZE: "10"
  PREFECT_API_DATABASE_SQLALCHEMY_MAX_OVERFLOW: "20"

  # Redis config unchanged...
Copy code
# prefect-background (services)
environment:
  PREFECT_LOGGING_LEVEL: "DEBUG"

  # Use SERVER_* here
  PREFECT_SERVER_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect:${POSTGRES_PASSWORD}@postgres:5432/prefect?sslmode=disable>
  PREFECT_SERVER_DATABASE_CONNECTION_TIMEOUT: "30"
  PREFECT_SERVER_DATABASE_TIMEOUT: "60"
  PREFECT_SERVER_DATABASE_SQLALCHEMY_POOL_SIZE: "10"
  PREFECT_SERVER_DATABASE_SQLALCHEMY_MAX_OVERFLOW: "20"

  # Redis config unchanged...
B) Give Postgres CPU and cap services CPU This prevents the starvation loop:
Copy code
services:
  postgres:
    # ...
    deploy:
      resources:
        limits:
          cpus: "2.0"
          memory: 2g
        reservations:
          cpus: "1.0"

  prefect-background:
    # ...
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: 1g
If you’re not using Swarm/Compose v3 deploy, you can use:
Copy code
cpus: "1.0"
mem_limit: "1g"
C) Slightly slow down service loops (reduces “retry churn”) These won’t materially affect latency for your scale, but reduce pressure while you tune DB:
Copy code
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_LOOP_SECONDS=5
PREFECT_SERVER_SERVICES_REPOSSESSOR_LOOP_SECONDS=15
PREFECT_SERVER_SERVICES_LATE_RUNS_LOOP_SECONDS=15
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_LOOP_SECONDS=60
# If you don't use concurrency limits:
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
D) Enable Postgres slow query logging while you test
This will confirm what’s slow and whether you’re hitting statement timeouts vs genuine slowness:
Copy code
postgres:
  image: postgres:15
  command:
    - "postgres"
    - "-c"
    - "log_min_duration_statement=1000"
    - "-c"
    - "max_connections=200"
    - "-c"
    - "shared_buffers=256MB"
Why a fresh DB can still be “slow” - If the services container hogs CPU, Postgres can’t respond promptly even for simple queries (connects/metadata) — you’ll see >10s waits and timeouts. - SSL negotiation attempts on every connect can add latency or fail if misconfigured. - Connection pool defaults can force more connects under load; with short timeouts that manifests as timeouts rather than true long-running queries. - If API/SERVER DB settings are mixed up, one of the processes may be defaulting to unexpected behavior (including SSL attempts, tiny timeouts). References - Settings reference: Prefect settings reference - Database configuration API (shows the timeout/pool knobs): Database configuration API If you can: - Apply the prefix fix + sslmode=disable + pool/timeouts - Cap services CPU and give Postgres a bit more - Share a snippet of Postgres slow query logs (just one example) and whether the timeouts go away I can help confirm if anything else needs tuning (e.g., specific indexes or loop intervals).
v
@Marvin With
?sslmode=disable
appended to the connection string I get
TypeError: connect() got an unexpected keyword argument 'sslmode'
m
thought for 59.1 seconds
Good catch — that error points to the driver. What’s wrong - Your connection URL uses asyncpg (
postgresql+asyncpg://...
). -
sslmode=
is a psycopg/psql parameter. asyncpg does not accept
sslmode
and raises: “connect() got an unexpected keyword argument 'sslmode'”. - For asyncpg, use
ssl=false
(or omit SSL entirely) instead of
sslmode=disable
. What to use instead - Disable SSL on the internal Docker network with
?ssl=false
. Also, make sure the API/UI and services containers use the correct prefixes (they’re swapped in your compose): - In prefect-api (API/UI):
Copy code
PREFECT_API_DATABASE_CONNECTION_URL=<postgresql+asyncpg://prefect:${POSTGRES_PASSWORD}@postgres:5432/prefect?ssl=false>
PREFECT_API_DATABASE_CONNECTION_TIMEOUT=30
PREFECT_API_DATABASE_TIMEOUT=60
PREFECT_API_DATABASE_SQLALCHEMY_POOL_SIZE=10
PREFECT_API_DATABASE_SQLALCHEMY_MAX_OVERFLOW=20
- In prefect-background (services):
Copy code
PREFECT_SERVER_DATABASE_CONNECTION_URL=<postgresql+asyncpg://prefect:${POSTGRES_PASSWORD}@postgres:5432/prefect?ssl=false>
PREFECT_SERVER_DATABASE_CONNECTION_TIMEOUT=30
PREFECT_SERVER_DATABASE_TIMEOUT=60
PREFECT_SERVER_DATABASE_SQLALCHEMY_POOL_SIZE=10
PREFECT_SERVER_DATABASE_SQLALCHEMY_MAX_OVERFLOW=20
Notes - Remove the angle brackets in your redacted example — they’ll break parsing (e.g., use `postgresql+asyncpg://prefect:${POSTGRES_PASSWORD}@postgres:5432/prefect?ssl=false`). - Keeping the pool/timeouts from above +
ssl=false
should eliminate the connect/handshake delays and reduce those 10s “slow queries” that were really connection waits. References - Prefect settings reference: Settings reference - Prefect DB configuration API (shows asyncpg options): Database configuration API If you update the URLs as above and still see timeouts, send one example from Postgres logs with duration and query text — we can dig into any remaining slow statements.
v
@Marvin I see a lot of queries that are
idle in transaction
in postgres. See :
Copy code
> SELECT datname, state, (now() - pg_stat_activity.query_start) as time_running, query FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '1 second' ORDER BY 3;
 datname |        state        |  time_running   |                                                                                                                                                                                                                                                                                                                                                               query                                                                                                                                                                                                                                                                                                                                                                
---------+---------------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 prefect | idle in transaction | 00:00:17.271294 | DELETE FROM automation_event_follower WHERE automation_event_follower.scope = $1::VARCHAR AND automation_event_follower.follower_event_id = $2::UUID
 prefect | idle in transaction | 00:00:17.276217 | SELECT work_pool.id AS work_pool_id, work_pool.name AS work_pool_name, work_pool.description AS work_pool_description, work_pool.type AS work_pool_type, work_pool.base_job_template AS work_pool_base_job_template, work_pool.is_paused AS work_pool_is_paused, work_pool.default_queue_id AS work_pool_default_queue_id, work_pool.concurrency_limit AS work_pool_concurrency_limit, work_pool.status AS work_pool_status, work_pool.last_transitioned_status_at AS work_pool_last_transitioned_status_at, work_pool.last_status_event_id AS work_pool_last_status_event_id, work_pool.storage_configuration AS work_pool_storage_configuration, work_pool.created AS work_pool_created, work_pool.updated AS work_pool_updated +
         |                     |                 | FROM work_pool                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +
         |                     |                 | WHERE work_pool.id IN ($1::UUID)
 prefect | idle in transaction | 00:00:17.27632  | ;
 prefect | idle in transaction | 00:00:17.276367 | ;
 prefect | idle in transaction | 00:00:17.276432 | BEGIN;
 prefect | active              | 00:00:17.276509 | ;
 prefect | idle in transaction | 00:00:17.289364 | ;
 prefect | idle in transaction | 00:00:17.289681 | BEGIN;
 prefect | idle in transaction | 00:00:17.300685 | SELECT deployment_schedule.deployment_id, deployment_schedule.schedule, deployment_schedule.active, deployment_schedule.max_scheduled_runs, deployment_schedule.parameters, deployment_schedule.slug, deployment_schedule.id, deployment_schedule.created, deployment_schedule.updated                                                                                                                                                                                                                                                                                                                                                                                                                                            +
         |                     |                 | FROM deployment_schedule                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
         |                     |                 | WHERE deployment_schedule.deployment_id = $1::UUID AND deployment_schedule.active IS true ORDER BY deployment_schedule.updated DESC
 prefect | idle                | 00:02:26.987803 | ROLLBACK;
 prefect | idle                | 00:24:24.62834  | LISTEN "prefect_automation_changes"
 prefect | idle                | 00:24:30.749047 | LISTEN "prefect_automation_changes"
m
thought for 326.5 seconds
Thanks for the details — this is helpful. What you’re seeing breaks down into two buckets: 1) LISTEN "prefect_automation_changes" (state: idle) - This is expected. Prefect’s automations subsystem uses Postgres LISTEN/NOTIFY to react to automation changes without polling. - The listener is implemented here: postgres_listener.py and wired into automations here: events/triggers.py (see
listen_for_automation_changes
). - These connections will typically sit “idle” for a long time. They are not in a transaction and are harmless. 2) idle in transaction sessions with queries like DELETE FROM automation_event_follower…, SELECT … FROM work_pool … - “idle in transaction” means a transaction was opened (
BEGIN
) and not yet committed/rolled back. Short-lived idles (a few seconds) can occur during normal request handling, but long-lived ones (minutes+) are not expected and can cause autovacuum bloat or lock contention. - Prefect’s API uses async SQLAlchemy sessions and transactions under the hood (see DB config: database/configurations.py and ORM models including `AutomationEventFollower`: database/orm_models.py). Those DELETEs you see are part of the automations/event follower cleanup path. A few targeted recommendations - First, determine if these are transient or persistent: - Run:
Copy code
SELECT pid, application_name, state, xact_start, (now() - xact_start) AS xact_age, wait_event_type, wait_event, query
    FROM pg_stat_activity
    WHERE datname = 'prefect' AND state = 'idle in transaction'
    ORDER BY xact_start;
- If you see “idle in transaction” persisting for minutes/hours, that’s worth action. If it’s 10–30s and clears, that’s typically fine. - Protect the database with Postgres’ idle-in-transaction timeout - Configure
idle_in_transaction_session_timeout
at the DB or role level to automatically terminate sessions that idle in a transaction too long. Many teams use 60s–5m:
Copy code
ALTER DATABASE prefect SET idle_in_transaction_session_timeout = '60s';
    -- or at the role level:
    ALTER ROLE <prefect_db_user> SET idle_in_transaction_session_timeout = '60s';
- This prevents bloat from orphaned transactions if anything downstream misbehaves. - Keep Prefect up to date - If you’re not on the latest 3.x, upgrade. There are ongoing improvements in the automations/events path and DB handling. Latest notes: Prefect releases (3.4.15 notes: link). - Tune connection pooling if you see many concurrent sessions - Prefect’s async Postgres configuration supports pool and overflow limits:
sqlalchemy_pool_size
,
sqlalchemy_max_overflow
, etc. See the config class (parameters listed here): AsyncPostgresConfiguration. - If you’re seeing a lot of concurrent API traffic and spikes of short-lived transactions, reducing pool size/overflow or increasing it (depending on your DB capacity) can stabilize session churn. - Sanity-check Postgres version - Prefect 3.x docs recommend Postgres 14.9+ (documented recently: release notes). If you’re still seeing a buildup of “idle in transaction” that lasts for minutes+: - Please share: - Prefect version - Self-hosted vs. Cloud - Postgres version - How long these idle-in-transaction sessions persist and whether they correlate with specific API calls or background services (e.g., automations changes, scheduler)
- Any logs indicating connection pool exhaustion or slow queries With that info, I can help pinpoint whether a particular API route or background service is leaving transactions open longer than expected and suggest a precise fix.
v
@Marvin I also see active transactions that are really long. An example :
Copy code
datname | state  |  time_running   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              query                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
---------+--------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 prefect | active | 00:00:28.720579 | INSERT INTO task_run (flow_run_id, task_key, dynamic_key, flow_run_run_count, empirical_policy, task_inputs, tags, labels, state_id, name, state_type, state_name, state_timestamp, run_count, expected_start_time, total_run_time, id, created, updated) VALUES ($1::UUID, $2::VARCHAR, $3::VARCHAR, $4::INTEGER, $5, $6, $7::JSONB, $8::JSONB, $9::UUID, $10::VARCHAR, $11::state_type, $12::VARCHAR, $13::TIMESTAMP WITH TIME ZONE, $14::INTEGER, $15::TIMESTAMP WITH TIME ZONE, $16::INTERVAL, $17::UUID, $18::TIMESTAMP WITH TIME ZONE, $19::TIMESTAMP WITH TIME ZONE) ON CONFLICT (id) DO UPDATE SET flow_run_id = $20::UUID, task_key = $21::VARCHAR, dynamic_key = $22::VARCHAR, flow_run_run_count = $23::INTEGER, empirical_policy = $24, task_inputs = $25, tags = $26::JSONB, labels = $27::JSONB, state_id = $28::UUID, name = $29::VARCHAR, state_type = $30::state_type, state_name = $31::VARCHAR, state_timestamp = $32::TIMESTAMP WITH TIME ZONE, run_count = $33::INTEGER, expected_start_time = $34::TIMESTAMP WITH TIME ZONE, total_run_ti
m
Your message was too long, here's your message at the allowed limit: I also see active transactions that are really long An example ``` datname | state | time_running | query ---------+--------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- prefect | active | 00 00 09 089731 | INSERT INTO task_run (flow_run_id task_key dynamic_key flow_run_run_count empirical_policy task_inputs tags labels state_id name state_type state_name state_timestamp run_count expected_start_time total_run_time id created updated) VALUES ($1 UUID $2 VARCHAR $3 VARCHAR $4 INTEGER $5 $6 $7 JSONB $8 JSONB $9 UUID $10 VARCHAR $11 state_type $12 VARCHAR $13 TIMESTAMP WITH TIME ZONE $14 INTEGER $15 TIMESTAMP WITH TIME ZONE $16 INTERVAL $17 UUID $18 TIMESTAMP WITH TIME ZONE $19 TIMESTAMP WITH TIME ZONE) ON CONFLICT (id) DO UPDATE SET flow_run_id = $20 UUID task_key = $21 VARCHAR dynamic_key = $22 VARCHAR flow_run_run_count = $23 INTEGER empirical_policy = $24 task_inputs = $25 tags = $26 JSONB labels = $27 JSONB state_id = $28 UUID name = $29 VARCHAR state_type = $30 state_type state_name = $31 VARCHAR state_timestamp = $32 TIMESTAMP WITH TIME ZONE run_count = $33 INTEGER expected_start_time = $34 TIMESTAMP WITH TIME ZONE total_run_ti prefect | active | 00 00 10 07376 | UPDATE concurrency_limit_v2 SET active_slots=CASE WHEN (greatest($1 INTEGER concurrency_limit_v2 active_slots - floor(concurrency_limit_v2 slot_decay_per_second * EXTRACT(epoch FROM timezone($2 VARCHAR now()) - timezone($3 VARCHAR concurrency_limit_v2 updated)))) - $4 INTEGER < $5 INTEGER) THEN $6 INTEGER ELSE greatest($7 INTEGER concurrency_limit_v2 active_slots - floor(concurrency_limit_v2 slot_decay_per_second * EXTRACT(epoch FROM timezone($8 VARCHAR now()) - timezone($9 VARCHAR concurrency_limit_v2 updated)))) - $10 INTEGER END denied_slots=greatest($11 INTEGER concurrency_limit_v2 denied_slots - floor(CASE WHEN (concurrency_limit_v2 slot_decay_per_second > $12 FLOAT) THEN concurrency_limit_v2 slot_decay_per_second ELSE $13 FLOAT / CAST(CAST(concurrency_limit_v2 avg_slot_occupancy_seconds AS FLOAT) AS FLOAT) END * EXTRACT(epoch FROM timezone($14 VARCHAR now()) - timezone($15 VARCHAR concurrency_limit_v2 updated)))) updated=now() WHERE concurrency_limit_v2 id IN ($16 UUID) AND c prefect | active | 00 00 53 235656 | SELECT automation_event_follower follower + | | | FROM automation_event_follower + | | | WHERE automation_event_follower scope = $1 VARCHAR AND automation_event_follower received < $2 TIMESTAMP WITH TIME ZONE prefect | active | 00 00 53 235954 | INSERT INTO events (occurred event resource_id resource related_resource_ids related payload received recorded follows id created updated) VALUES ($1 TIMESTAMP WITH TIME ZONE $2 VARCHAR $3 VARCHAR $4 JSONB $5 JSONB $6 JSONB $7 JSONB $8 TIMESTAMP WITH TIME ZONE $9 TIMESTAMP WITH TIME ZONE $10 UUID $11 UUID $12 TIMESTAMP WITH TIME ZONE $13 TIMESTAMP WITH TIME ZONE) ON CONFLICT DO NOTHING RETURNING events id prefect | active | 00 00 53 241906 | SELECT flow
_run flow_id flow_run deployment_id flow_run work_queue_name flow_run flow_version flow_run deployment_version flow_run parameters flow_run idempotency_key flow_run context flow_run empirical_policy flow_run tags flow_run labels flow_run created_by flow_run infrastructure_pid flow_run job_variables flow_run infrastructure_document_id flow_run parent_task_run_id flow_run auto_scheduled flow_run state_id flow_run work_queue_id flow_run name flow_run state_type flow_run state_name flow_run state_timestamp flow_run run_count flow_run expected_start_time flow_run next_scheduled_start_time flow_run start_time flow_run end_time flow_run total_run_time flow_run id flow_run created flow_run updated + | | | FROM flow_run + | | | WHERE flow_run state_type = $1 state_type AND flow_run end_time IS NOT NULL AND flow_run end_time >= $2 TIMESTAMP WITH TIME ZONE AND flow_run id
thought for 83.4 seconds
Got it — thanks for the extra detail. That INSERT/UPSERT into
task_run
is a very common write path in Prefect (created and updated on task state transitions), but if it’s “active” for a long time it’s usually waiting on a lock or I/O rather than actually doing work. What to check right now - Is it actually waiting on a lock?
Copy code
SELECT pid, state, wait_event_type, wait_event,
         now() - query_start AS age, query
  FROM pg_stat_activity
  WHERE datname = 'prefect' AND pid = <that_pid>;
If
wait_event_type = 'Lock'
, it’s blocked by another session. - Who is blocking it?
Copy code
SELECT
    a.pid AS blocked_pid,
    a.wait_event_type, a.wait_event,
    now() - a.query_start AS blocked_for,
    a.query AS blocked_query,
    b.pid AS blocking_pid,
    now() - b.query_start AS blocking_for,
    b.state AS blocking_state,
    b.query AS blocking_query,
    l.relation::regclass AS blocked_relation,
    l.locktype, l.mode
  FROM pg_stat_activity a
  JOIN pg_locks l ON l.pid = a.pid AND NOT l.granted
  JOIN pg_locks bl ON bl.locktype = l.locktype
    AND bl.database IS NOT DISTINCT FROM l.database
    AND bl.relation IS NOT DISTINCT FROM l.relation
    AND bl.page IS NOT DISTINCT FROM l.page
    AND bl.tuple IS NOT DISTINCT FROM l.tuple
    AND bl.virtualxid IS NOT DISTINCT FROM l.virtualxid
    AND bl.transactionid IS NOT DISTINCT FROM l.transactionid
    AND bl.classid IS NOT DISTINCT FROM l.classid
    AND bl.objid IS NOT DISTINCT FROM l.objid
    AND bl.objsubid IS NOT DISTINCT FROM l.objsubid
  JOIN pg_stat_activity b ON b.pid = bl.pid
  WHERE a.datname = 'prefect' AND a.pid = <that_pid>;
This will tell you exactly which session/table is holding the lock. - See all blockers at once:
Copy code
SELECT pid,
         pg_blocking_pids(pid) AS blocking_pids,
         state, wait_event_type, wait_event,
         now() - query_start AS age,
         query
  FROM pg_stat_activity
  WHERE datname = 'prefect' AND (wait_event_type IS NOT NULL OR state='idle in transaction')
  ORDER BY age DESC;
Common root causes we see - An “idle in transaction” session is the blocker. If a request opened a transaction and never committed/rolled back, it can hold row/table locks that block the
task_run
UPSERT. This matches the earlier “idle in transaction” you’re seeing. - FK contention:
task_run.flow_run_id
has an FK to
flow_run
. If another transaction is updating/deleting the referenced
flow_run
row, your insert can wait on that row lock. - Schema changes/migrations or maintenance running concurrently. - Autovacuum not keeping up on hot tables (e.g.,
task_run
,
task_run_state
,
flow_run
,
flow_run_state
) leading to bloat and slower constraint/index checks. Immediate mitigations - Set a safety net for idle transactions:
Copy code
ALTER DATABASE prefect SET idle_in_transaction_session_timeout = '60s';
  -- or at the role level:
  ALTER ROLE <prefect_user> SET idle_in_transaction_session_timeout = '60s';
This prevents idle transactions from lingering and blocking writers. - Optionally set a statement timeout to prevent runaway writes:
Copy code
ALTER DATABASE prefect SET statement_timeout = '60s';
Use with care — it will surface as errors to the client if statements legitimately take longer. - If you find a clear blocker, you can terminate it:
Copy code
SELECT pg_terminate_backend(<blocking_pid>);
Only do this if you know it’s safe for your workload. Longer-term checks - Make sure you’re on a current Prefect 3.x release; we regularly improve DB interactions. Releases: Prefect releases. Task/flow ORM and DB config are here: - ORM models (includes TaskRun/FlowRun definitions): orm_models.py
- Async Postgres configuration (pooling settings available): database/configurations.py - Ensure Postgres 14.9+ (documented recently: 3.4.15 notes). - Tune autovacuum on hot Prefect tables. Consider lowering scale factors and/or increasing cost limits so VACUUM runs earlier/more often on: -
task_run
,
task_run_state
,
flow_run
,
flow_run_state
,
log
Example inspection:
Copy code
SELECT relname, n_live_tup, n_dead_tup, vacuum_count, autovacuum_count
  FROM pg_stat_user_tables
  WHERE relname IN ('task_run','task_run_state','flow_run','flow_run_state','log')
  ORDER BY n_dead_tup DESC;
- If you have high API/worker concurrency, right-size the SQLAlchemy pool to match DB capacity to reduce connection/lock thrash. See
sqlalchemy_pool_size
and
sqlalchemy_max_overflow
in AsyncPostgresConfiguration. If you can share: - Prefect version, self-hosted or Cloud - Postgres version - The output of the blocker queries above (wait_event_type, relation, blocking_pid/blocking_query) - Typical duration of these “active” sessions and whether they correlate with spikes in flow/task activity …I can help pinpoint exactly which Prefect path (e.g., a specific API route, automations cleanup, scheduler) is holding the lock and suggest a targeted fix.
v
@Marvin I caught an instance of an
INSERT INTO events
active for 50 seconds and the
wait_event
is ClientRead
m
thought for 50.3 seconds
Thanks — that’s a great clue. In Postgres, wait_event = ClientRead means the backend is waiting to read from the client socket. In other words, the database isn’t blocked on locks or I/O; it’s your client (the Prefect API/worker process or any emitter calling the events endpoint) that hasn’t completed the request/transaction yet. What this usually implies - The client started a transaction (e.g., INSERT INTO events …) and then stalled before sending the next message or finishing the commit. This keeps the transaction “active” on the DB side. - Common causes: API handler blocked on CPU or event loop, network hiccups between API and DB, slow proxy (e.g., PgBouncer/LB), or the client sending/receiving large payloads slowly. Immediate next checks - Identify the client process and where it runs:
Copy code
SELECT pid, application_name, client_addr, usename, backend_start, query_start,
         state, wait_event_type, wait_event, query
  FROM pg_stat_activity
  WHERE pid = <pid>;
- application_name is settable by Prefect (see
connection_app_name
in the DB config: AsyncPostgresConfiguration). This helps attribute sessions to API pods/workers. - Correlate with API logs/metrics around that time - Check the Prefect API container logs for slow handler warnings or exceptions around that timestamp. - Look for CPU saturation, event loop blocking, or GC pauses on the API pod/host. - If you’re using PgBouncer or a reverse proxy, check its logs too. Network/infra considerations - If there’s a proxy between API and Postgres (PgBouncer, LB, sidecar), ClientRead can be the database waiting on the proxy which is waiting on the API. Check for idle connection timeouts and buffering limits there. - Enable TCP keepalives on Postgres to detect dead or stuck clients faster:
Copy code
ALTER SYSTEM SET tcp_keepalives_idle = 30;
  ALTER SYSTEM SET tcp_keepalives_interval = 10;
  ALTER SYSTEM SET tcp_keepalives_count = 3;
  SELECT pg_reload_conf();
- Consider a conservative statement timeout to surface these stalls as errors instead of long-lived transactions:
Copy code
ALTER DATABASE prefect SET statement_timeout = '60s';
Be aware this will error out long-running legitimate statements, so tune by environment. Prefect specifics you can review - The events write path is handled by the API endpoint here: server/api/events.py (function
create_events
), and events clients are here: server/events/clients.py. It should open a session, insert, and commit promptly. If the API process is starved or blocked mid-request, the transaction can linger. - DB connection/pooling settings (pool size, overflow, statement cache, application_name) are here: database/configurations.py. Mismatch between pool size and DB capacity can amplify backpressure. - Make sure you’re on a current 3.x release: Prefect releases. We continue to improve DB interaction and events handling. If you can share: - Prefect version and how the API is deployed (k8s/gunicorn/uvicorn workers, PgBouncer yes/no) - Postgres version - Whether the ClientRead cases align with load spikes (many event emissions) or specific pods - application_name/client_addr for those PIDs and concurrent API logs …I can help pinpoint whether this is event-loop blocking in the API, proxy behavior, or something else. In the meantime, adding a reasonable
statement_timeout
and enabling TCP keepalives will keep these from accumulating and aid diagnosis.
v
@Marvin How can I split services between two containers, as you suggested, using one container for scheduler and recorder and one for the rest, using docker compose ?
m
thought for 284.6 seconds
Great question — yes, you can split Prefect Server services across multiple containers with Docker Compose. In Prefect 3.x, you can: - Run the API/UI (webserver) alone with
prefect server start --no-services
- Run only the background services with
prefect server services start
- Enable/disable specific services via environment variables Below is a working Compose example that: - Starts Postgres - Runs the API/UI in one container - Splits services across two containers: - services-a: Scheduler + TaskRunRecorder - services-b: all other services
Copy code
version: "3.9"

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: prefect
      POSTGRES_PASSWORD: prefect
      POSTGRES_DB: prefect
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U prefect -d prefect"]
      interval: 5s
      timeout: 5s
      retries: 20

  # API/UI only (no background services)
  prefect-api:
    image: prefecthq/prefect:3-latest
    depends_on:
      postgres:
        condition: service_healthy
    environment:
      PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect:prefect@postgres:5432/prefect>
      PREFECT_UI_ENABLED: "true"
      PREFECT_SERVER_LOGGING_LEVEL: INFO
      # Keep API/UI in this container; services run elsewhere
    command: >
      bash -lc "prefect server start --no-services --host 0.0.0.0 --port 4200"
    ports:
      - "4200:4200"
    restart: unless-stopped

  # Services container A: Scheduler + TaskRunRecorder only
  prefect-services-a:
    image: prefecthq/prefect:3-latest
    depends_on:
      postgres:
        condition: service_healthy
    environment:
      PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect:prefect@postgres:5432/prefect>

      # Enable only the services you want in this container
      PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED: "true"
      PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED: "true"

      # Disable everything else here
      PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED: "false"
      PREFECT_SERVER_SERVICES_FOREMAN_ENABLED: "false"
      PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED: "false"
      PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED: "false"
      PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED: "false"
      PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED: "false"
      PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED: "false"
      PREFECT_API_EVENTS_STREAM_OUT_ENABLED: "false"  # Distributor
      PREFECT_SERVER_ANALYTICS_ENABLED: "false"       # Telemetry
    command: >
      bash -lc "prefect server services start"
    restart: unless-stopped

  # Services container B: everything else (no Scheduler/Recorder)
  prefect-services-b:
    image: prefecthq/prefect:3-latest
    depends_on:
      postgres:
        condition: service_healthy
    environment:
      PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect:prefect@postgres:5432/prefect>

      # Disable scheduler + recorder in this container
      PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED: "false"
      PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED: "false"

      # Enable the rest here (tune to your needs)
      PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED: "true"
      PREFECT_SERVER_SERVICES_FOREMAN_ENABLED: "true"
      PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED: "true"
      PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED: "true"
      PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED: "true"
      PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED: "true"
      PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED: "true"
      PREFECT_API_EVENTS_STREAM_OUT_ENABLED: "true"  # Distributor
      PREFECT_SERVER_ANALYTICS_ENABLED: "true"       # Telemetry
    command: >
      bash -lc "prefect server services start"
    restart: unless-stopped

volumes:
  pgdata:
How this works: -
prefect-api
runs only the webserver (API/UI) via
--no-services
-
prefect-services-a
and
prefect-services-b
both run
prefect server services start
but you control which services are active via env vars - All containers share the same Postgres database via
PREFECT_API_DATABASE_CONNECTION_URL
Useful commands and tuning: - See what services exist and the exact env var names: -
prefect server services list-services
- Start services in the foreground/background (in a container we run foreground): -
prefect server services start
- You can toggle services via env vars (true/false). Common ones: - Scheduler:
PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED
- TaskRunRecorder:
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED
- Late runs:
PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED
- Triggers:
PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED
- Event logger/persister:
PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED
,
PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED
- Distributor:
PREFECT_API_EVENTS_STREAM_OUT_ENABLED
- Telemetry:
PREFECT_SERVER_ANALYTICS_ENABLED
Links: - Self-hosted with Docker Compose - prefect server CLI reference - Settings reference (env vars) Questions to confirm: - Do you want the API/UI to live in one of the service containers instead of a separate
prefect-api
container? If so, we can collapse them, but keeping them separate is typically cleaner. - Any other services you specifically want grouped with scheduler/recorder? I can tailor the env toggles to match.
v
@Marvin What would be your recommenced split of services between containers ? How many containers and which services in each ?
m
thought for 51.0 seconds
Short answer: I recommend 3 containers in production (plus Postgres): - 1 for the API/UI - 1 for “core scheduling” services (scheduler + recorder and a few orchestration loops) - 1 for “events and triggers” services This split isolates the most latency‑sensitive loops (scheduler/recorder) from bursty work (events, triggers), while keeping things simple to operate. For smaller installs, you can merge the two service containers into one. For larger or spikier workloads, this split tends to be the sweet spot. Details and exact service grouping Container A: API/UI (webserver only) - Runs:
prefect server start --no-services --host 0.0.0.0 --port 4200
- Purpose: Serves the REST API and UI, no background loops here - Notes: You can toggle UI with
--ui/--no-ui
if needed Container B: Core scheduling services (time‑sensitive orchestration) Enable: -
PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=true
(Scheduler) -
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=true
(TaskRunRecorder) -
PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED=true
(MarkLateRuns) -
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED=true
(FailExpiredPauses) -
PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED=true
(CancellationCleanup) -
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=true
(Foreman) Disable here: -
PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED=false
-
PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=false
-
PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
-
PREFECT_API_EVENTS_STREAM_OUT_ENABLED=false
(Distributor) -
PREFECT_SERVER_ANALYTICS_ENABLED=false
(Telemetry; optional preference) Run:
prefect server services start
Container C: Events and triggers (bursty/background) Enable: -
PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED=true
(Proactive/Reactive/Actions) -
PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=true
-
PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=true
-
PREFECT_API_EVENTS_STREAM_OUT_ENABLED=true
(Distributor) -
PREFECT_SERVER_ANALYTICS_ENABLED=true
(Telemetry; optional) Disable here: -
PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=false
-
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=false
-
PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED=false
-
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED=false
-
PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED=false
-
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
Run:
prefect server services start
Important notes - Single instance per service: Ensure each service is enabled in exactly one container to avoid duplicate processing. The env toggles above enforce that. - Database: All containers point to the same Postgres via
PREFECT_API_DATABASE_CONNECTION_URL
. - Health and resilience: Use restart policies and healthchecks; if a services container fails, the other containers keep running. - Scaling: These loops are designed to run as singletons. Scale by allocating more CPU/RAM to the appropriate container(s), not by running duplicates of the same service. When to choose fewer containers - Dev/small: 2 containers is fine - API/UI - All services together (enable everything in one) - Medium/most prod: 3 containers (as above) - Specialized/high‑throughput events: Keep 3, but give the “events and triggers” container extra CPU/memory Handy commands and references - List services and confirm env var names: -
prefect server services list-services
- CLI reference: Prefect server CLI - Settings/env vars: Settings reference - Docker Compose self‑hosting guide: Self-hosted with Docker Compose If you share your workload characteristics (number of runs/day, use of triggers/events, expected spikes), I can tailor the exact split and sizing further.
v
@Marvin what about the repossessor service ?
m
thought for 112.1 seconds
Great callout. In Prefect 3.x, “Repossessor” is not a user-managed service anymore: - It does not appear in
prefect server services list-services
- There’s no toggle like
PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED
- Its responsibilities are effectively covered by other orchestration loops (e.g., Foreman, CancellationCleanup, LateRuns, TaskRunRecorder) So you don’t need to account for a separate “Repossessor” in your container split. If you’re following the 3-container recommendation, just place the nearest analogs in the “Core scheduling” container: Core scheduling container should include: - Scheduler (and RecentDeploymentsScheduler via the same flag) - TaskRunRecorder - LateRuns - PauseExpirations - CancellationCleanup - Foreman - If you’re on a custom/older build where Repossessor shows up, put it here too To confirm what’s actually available in your build, run:
Copy code
prefect server services list-services
References: - Prefect server CLI - Settings reference
v
@Marvin By running services in separate containers I was able to pinpoint the service that eats all my CPU. It's the Scheduler service. Given I have 42 flows, with a maximum of 2 to 3 runs per minute, is that expected ? How can I reduce the CPU load from this service ?
m
thought for 226.2 seconds
Short answer: with ~42 deployments creating 2–3 runs/min, a scheduler pegging a CPU core isn’t typical. It usually means the scheduler is doing extra work each cycle (e.g., catching up/backfilling missed runs, scanning too many deployments per loop, or computing lots of schedule occurrences). Ways to reduce CPU (Prefect 3.x) Tuning the scheduler via environment variables is the most effective lever. The scheduler service is controlled by env vars prefixed with
PREFECT_API_SERVICES_SCHEDULER_
. The most relevant knobs: - LOOP_SECONDS — how often the scheduler wakes up. Increase to reduce wakeups. - DEPLOYMENT_BATCH_SIZE — how many deployments it evaluates per cycle. Decrease to reduce per-loop work. - INSERT_BATCH_SIZE — how many flow runs it inserts per transaction. Decrease to reduce bursty DB work. - MAX_RUNS — cap on how many runs the scheduler will create each cycle. Decrease to throttle work. - MIN_SCHEDULED_TIME — how far into the past the scheduler will create runs (affects “catch-up”). Reduce to avoid backfilling. - MAX_SCHEDULED_TIME — how far into the future to schedule runs. Reduce to shrink planning window. - RECENT_DEPLOYMENTS_LOOP_SECONDS — how often the “recent deployments” sub-loop runs; increase to reduce churn shortly after creating/updating deployments. Example settings you can try (docker-compose or k8s env):
Copy code
PREFECT_API_SERVICES_SCHEDULER_LOOP_SECONDS=60
PREFECT_API_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE=50
PREFECT_API_SERVICES_SCHEDULER_INSERT_BATCH_SIZE=100
PREFECT_API_SERVICES_SCHEDULER_MAX_RUNS=50
PREFECT_API_SERVICES_SCHEDULER_MIN_SCHEDULED_TIME=PT5M
PREFECT_API_SERVICES_SCHEDULER_MAX_SCHEDULED_TIME=P14D
PREFECT_API_SERVICES_SCHEDULER_RECENT_DEPLOYMENTS_LOOP_SECONDS=30
Notes and gotchas - Backfill/catch-up: Prefect does not have a global “catchup=False” flag. By default, the scheduler will create runs for a time window that includes some recent past. If your server was down or deployments were just created with a start time far in the past, the scheduler can work hard to catch up. To minimize backfill: - Set
PREFECT_API_SERVICES_SCHEDULER_MIN_SCHEDULED_TIME
small (e.g.,
PT0S
or
PT5M
). - Start schedules “now” or slightly in the future when creating deployments so there aren’t many historical occurrences to compute. - If you don’t need a schedule for some deployments, remove it (use
NoSchedule
) and trigger runs ad hoc. - Schedule complexity: Many per-minute cron schedules or complex RRules across many deployments can increase CPU. If possible, prefer a slightly lower frequency or simpler schedules. - Database: Using Postgres is recommended. SQLite will exacerbate scheduler work. Also ensure your DB has sufficient CPU/IO—some “scheduler CPU” can actually be time spent in DB calls. - Single-core saturation: The scheduler is a single Python process; 100% of one core is “full” for it. On small machines, that can look like high overall CPU even if it’s just one core. Docs - Settings reference (lists the scheduler env vars and defaults): Prefect 3 settings reference - Scheduling concepts: Prefect 3 schedules If you can share a bit more, I can tailor the guidance: - Exact Prefect version - Your DB backend (Postgres vs SQLite) - A sample of your schedules (cron/interval and frequencies) - Whether the high CPU is steady-state or only after restarts/downtime With that, I can suggest concrete values for the env vars above and spot any schedule patterns that are causing extra work.