Vincent Chéry
09/01/2025, 3:59 PMno-services
and one with services only. I have 42 flows deployed polled by a single worker and run 2 to 5 flows each minute. The services container is eating all available CPU, is that normal ? How come it is so greedy ? Any tips to reduce the resource consumption of the system ?Marvin
09/01/2025, 3:59 PMMarvin
09/01/2025, 4:04 PMPREFECT_SERVER_SERVICES_<SERVICE>_LOOP_SECONDS
- Example knobs that are typically safe to increase:
- Scheduler: PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=5
or 10
- Task recorder: PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_LOOP_SECONDS=5
- Repossessor: PREFECT_SERVER_SERVICES_REPOSSESSOR_LOOP_SECONDS=15
- Late runs: PREFECT_SERVER_SERVICES_LATE_RUNS_LOOP_SECONDS=15
- Pause expirations: PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_LOOP_SECONDS=60
- Foreman: PREFECT_SERVER_SERVICES_FOREMAN_LOOP_SECONDS=5
(or disable if not using concurrency)
- Full settings reference: Prefect settings reference
2) Disable services you don’t use
- If you don’t use concurrency limits, disable the Foreman:
- PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
- Telemetry is optional in most self-hosted setups:
- PREFECT_SERVER_SERVICES_TELEMETRY_ENABLED=false
- Pattern: PREFECT_SERVER_SERVICES_<SERVICE>_ENABLED=false
- See service list here: Services API reference
3) Reduce schedule density where feasible
- If you have many per-minute schedules, consider staggering or batching them.
- A single deployment with a cron schedule and parameterization can sometimes replace many similar minutely schedules.
4) Use Postgres for the API database (if not already)
- SQLite can cause contention under write-heavy workloads. Postgres typically improves throughput and reduces CPU in the Python process.
5) Cap and/or split resources at the container level
- It’s reasonable to cap the services container so it doesn’t starve the rest of your stack:
services:
prefect-services:
cpus: "2.0"
mem_limit: "2g"
- If you want, you can run multiple services containers and split the services (e.g., one for scheduler + recorder, another for the rest) to isolate hotspots. The CLI supports starting services independently — check the help for current flags and service names:
- Docs: prefect server CLI
- If you prefer a single command to run the API/UI: prefect server start --no-services
(as you’re doing) and run services in a separate container.
6) Turn down logging verbosity
- PREFECT_LOGGING_LEVEL=INFO
(or WARNING
) in the services container can reduce overhead.
7) Upgrade to the latest Prefect 3.xMarvin
09/01/2025, 4:04 PMenvironment:
- PREFECT_LOGGING_LEVEL=INFO
- PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
- PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_LOOP_SECONDS=5
- PREFECT_SERVER_SERVICES_REPOSSESSOR_LOOP_SECONDS=15
- PREFECT_SERVER_SERVICES_LATE_RUNS_LOOP_SECONDS=15
- PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_LOOP_SECONDS=60
# Disable services you do not use:
- PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
- PREFECT_SERVER_SERVICES_TELEMETRY_ENABLED=false
How to diagnose which service is hot
- Temporarily set PREFECT_LOGGING_LEVEL=DEBUG
on the services container and watch logs — each service logs its loop ticks and timings which can reveal the busiest loop.
- If using Postgres, also check DB CPU; the services process can appear busy if the DB is slow and it’s doing many queries.
A few links for reference
- Self-hosting overview: Self-hosted guide
- Services API reference (list of services): prefect.server.services
- Server CLI (including services commands): prefect server CLI
- Settings reference: Settings reference
To help tailor this, could you share:
- Your Prefect version (e.g., 3.1.x)
- DB backend (SQLite vs Postgres)
- Whether you use concurrency limits or automations
- A redacted snippet of the docker-compose for the services container (env and command)
- Any logs showing which service is looping frequently
With that, I can suggest a minimal set of exact settings to change for your setup.Vincent Chéry
09/02/2025, 12:19 PM12:14:32.852 | ERROR | prefect.server.services.recentdeploymentsscheduler - Unexpected error in: TimeoutError()
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/asyncpg/connection.py", line 2421, in connect
return await connect_utils._connect(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/asyncpg/connect_utils.py", line 1049, in _connect
conn = await _connect_addr(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/asyncpg/connect_utils.py", line 886, in _connect_addr
return await __connect_addr(params, True, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/asyncpg/connect_utils.py", line 931, in __connect_addr
tr, pr = await connector
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/asyncpg/connect_utils.py", line 802, in _create_ssl_connection
tr, pr = await loop.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/asyncio/base_events.py", line 1159, in create_connection
transport, protocol = await self._create_connection_transport(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/asyncio/base_events.py", line 1192, in _create_connection_transport
await waiter
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/prefect/server/services/base.py", line 242, in start
await self.run_once()
File "/usr/local/lib/python3.12/site-packages/prefect/server/services/scheduler.py", line 112, in run_once
inserted_runs = await self._insert_scheduled_flow_runs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/prefect/server/services/scheduler.py", line 300, in _insert_scheduled_flow_runs
return await models.deployments._insert_scheduled_flow_runs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/prefect/server/models/deployments.py", line 818, in _insert_scheduled_flow_runs
await session.execute(
File "/usr/local/lib/python3.12/site-packages/sqlalchemy/ext/asyncio/session.py", line 463, in execute
result = await greenlet_spawn(
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 201, in greenlet_spawn
result = context.throw(*sys.exc_info())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[...]
TimeoutError
Or this one :
12:12:00.391 | ERROR | prefect.server.events.triggers - Error running periodic evaluation
[...]
Can you tell me what's wrong ?Marvin
09/02/2025, 12:19 PMMarvin
09/02/2025, 12:19 PMMarvin
09/02/2025, 12:20 PMMarvin
09/02/2025, 12:22 PMprefect.server.services.recentdeploymentsscheduler
→ asyncpg connect → TimeoutError/CancelledError
- The “periodic evaluation” error in prefect.server.events.triggers
is likely the same root cause (DB access timing out)
- When services can’t connect promptly, they’ll keep retrying in their loops. Even if logs aren’t “chatty,” those retries can keep the event loop busy and show up as high CPU in the services container.
Most common root causes
- Postgres isn’t responding quickly enough (load, I/O, or too few connections available)
- Connection pool exhaustion (default pool is small)
- Too short DB timeouts (connect/statement)
- SSL handshake issues (trying SSL when not needed/allowed)
- DNS/network latency between the services container and Postgres
- SQLite (if used) under write pressure — but your stack trace shows asyncpg, so you’re on Postgres
What to check and change
1) Confirm DB and connectivity basics
- Ensure the URL is async and points to the right host:
- PREFECT_SERVER_DATABASE_CONNECTION_URL=<postgresql+asyncpg://user:pass@postgres:5432/prefect>
- If Postgres is on the same Docker network and you don’t require TLS, disable SSL to avoid slow handshakes:
- Append ?sslmode=disable
to the connection URL:
- <postgresql+asyncpg://user:pass@postgres:5432/prefect?sslmode=disable>
- Verify Postgres logs around the same timestamps — look for connection timeouts, too many connections, or slow queries.
2) Increase connection/pool settings for the services container
- Prefect 3 exposes these as settings (mapped to AsyncPostgresConfiguration). In the services container, set the “server” variants:
# services container
PREFECT_SERVER_DATABASE_CONNECTION_URL=<postgresql+asyncpg://user:pass@postgres:5432/prefect?sslmode=disable>
PREFECT_SERVER_DATABASE_CONNECTION_TIMEOUT=30 # connect timeout (seconds)
PREFECT_SERVER_DATABASE_TIMEOUT=60 # statement timeout (seconds)
PREFECT_SERVER_DATABASE_SQLALCHEMY_POOL_SIZE=10
PREFECT_SERVER_DATABASE_SQLALCHEMY_MAX_OVERFLOW=20
- In the API/UI container, use the same but with PREFECT_API_DATABASE_*
prefix.
References:
- Settings reference
- Database configuration API
3) If you don’t need TLS between containers
- Disabling SSL (see sslmode=disable
) often removes connect latency and handshake failures inside Compose networks.
4) Ensure Postgres can accept the load
- Check max_connections
and whether you’re hitting it.
- Monitor Postgres CPU and I/O. If it’s pegged, the services will time out waiting for connections.
- If you have many Prefect components connecting, consider pgbouncer, but most small installs are fine with a larger pool and max_overflow.
5) Slow service loops slightly to reduce pressure
- Raising loop intervals reduces how aggressively services re-attempt DB work:
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_LOOP_SECONDS=5
PREFECT_SERVER_SERVICES_REPOSSESSOR_LOOP_SECONDS=15
PREFECT_SERVER_SERVICES_LATE_RUNS_LOOP_SECONDS=15
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_LOOP_SECONDS=60
# If you do not use concurrency limits:
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
- Docs: Services reference and Settings reference
6) Cap or split the services container if needed
- It’s okay to cap CPU so retries don’t starve the node while you tune:
cpus: "2.0"
mem_limit: "2g"
- You can also split services into separate containers to isolate hot loops if desired.
Why this should helpMarvin
09/02/2025, 12:22 PMconnection_timeout
and the pool size gives the scheduler/trigger evaluators more headroom to obtain connections.
- Disabling SSL (if not required) avoids slow TLS handshakes you’re seeing in _create_ssl_connection
.
- Slightly slower loops + fewer retries reduce CPU churn when the DB is slow.
- Postgres-side checks ensure there’s not a true DB bottleneck causing these timeouts.
If you can share:
- Your services container env block (redacted URL/user)
- Postgres version and how it’s deployed in compose
- Whether you need TLS between containers
I can suggest the exact minimal set of env vars to add to your compose file.Vincent Chéry
09/02/2025, 12:56 PMservices:
postgres:
image: postgres:15
container_name: prefect-db
environment:
POSTGRES_USER: prefect
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:?error}
POSTGRES_DB: prefect
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U prefect"]
interval: 10s
timeout: 5s
retries: 5
ports:
- "5432:5432"
restart: always
redis:
image: redis:7
prefect-api:
image: prefecthq/prefect:3-python3.12
container_name: prefect-server
depends_on:
postgres:
condition: service_healthy
migrate:
condition: service_completed_successfully
redis:
condition: service_started
environment:
# Prefect API and UI configuration
PREFECT_API_URL: ${PREFECT_API_URL:?error}
PREFECT_UI_URL: ${PREFECT_UI_URL:?error}
# Database configuration
PREFECT_SERVER_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect>:${POSTGRES_PASSWORD:?error}@postgres:5432/prefect
PREFECT_API_DATABASE_MIGRATE_ON_START: "false"
# Redis
PREFECT_MESSAGING_BROKER: prefect_redis.messaging
PREFECT_MESSAGING_CACHE: prefect_redis.messaging
PREFECT_SERVER_EVENTS_CAUSAL_ORDERING: prefect_redis.ordering
PREFECT_SERVER_CONCURRENCY_LEASE_STORAGE: prefect_redis.lease_storage
PREFECT_REDIS_MESSAGING_HOST: redis
PREFECT_REDIS_MESSAGING_PORT: "6379"
ports:
- "4200:4200"
restart: always
command: ["prefect", "server", "start", "--host", "0.0.0.0", "--no-services"]
migrate:
image: prefecthq/prefect:3-python3.12
depends_on:
postgres:
condition: service_healthy
command: prefect server database upgrade -y
environment:
PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect>:${POSTGRES_PASSWORD:?error}@postgres:5432/prefect
prefect-background:
image: prefecthq/prefect:3-python3.12
container_name: prefect-background-services
depends_on:
postgres:
condition: service_healthy
migrate:
condition: service_completed_successfully
redis:
condition: service_started
command: prefect server services start
environment:
PREFECT_LOGGING_LEVEL: "DEBUG"
PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect>:${POSTGRES_PASSWORD:?error}@postgres:5432/prefect
PREFECT_API_DATABASE_MIGRATE_ON_START: "false"
PREFECT_MESSAGING_BROKER: prefect_redis.messaging
PREFECT_MESSAGING_CACHE: prefect_redis.messaging
PREFECT_SERVER_EVENTS_CAUSAL_ORDERING: prefect_redis.ordering
PREFECT_SERVER_CONCURRENCY_LEASE_STORAGE: prefect_redis.lease_storage
PREFECT_REDIS_MESSAGING_HOST: redis
PREFECT_REDIS_MESSAGING_PORT: "6379"
volumes:
postgres_data:
Marvin
09/02/2025, 12:57 PMMarvin
09/02/2025, 12:58 PMPREFECT_SERVER_DATABASE_CONNECTION_URL
- In your services container you set PREFECT_API_DATABASE_CONNECTION_URL
In Prefect 3.x:
- The API/UI process reads PREFECT_API_DATABASE_*
- The background services process reads PREFECT_SERVER_DATABASE_*
When these are swapped, each process may fall back to defaults or attempt SSL handshakes unexpectedly, causing connect timeouts and slow queries.
2) SSL handshake overhead inside Docker
Your stack trace shows _create_ssl_connection
during connect. Within a trusted Docker network you can usually disable SSL to avoid slow/aborted handshakes.
3) Default connection pool and timeout values can be too tight
Defaults are conservative (small pool and short timeouts) and can lead to statement/connection timeouts under modest load or when the DB is starved of CPU.
4) CPU starvation feedback loop
If the services container pegs CPU, Postgres can’t process queries fast enough, causing timeouts that make services retry aggressively, which keeps the CPU pegged. Give Postgres some reserved CPU and cap the services container.
What I recommend changing immediately
A) Fix env var prefixes and disable SSL on the internal network
- API/UI container should use PREFECT_API_DATABASE_*.
- Services container should use PREFECT_SERVER_DATABASE_*.
- Add sslmode=disable
to the asyncpg URL (safe on the internal Docker network).
- Increase timeouts and pool sizes a bit.
Example adjustments:
# prefect-api
environment:
PREFECT_API_URL: ${PREFECT_API_URL:?error}
PREFECT_UI_URL: ${PREFECT_UI_URL:?error}
# Use API_* here
PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect:${POSTGRES_PASSWORD}@postgres:5432/prefect?sslmode=disable>
PREFECT_API_DATABASE_CONNECTION_TIMEOUT: "30" # connect timeout (s)
PREFECT_API_DATABASE_TIMEOUT: "60" # statement timeout (s)
PREFECT_API_DATABASE_SQLALCHEMY_POOL_SIZE: "10"
PREFECT_API_DATABASE_SQLALCHEMY_MAX_OVERFLOW: "20"
# Redis config unchanged...
# prefect-background (services)
environment:
PREFECT_LOGGING_LEVEL: "DEBUG"
# Use SERVER_* here
PREFECT_SERVER_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect:${POSTGRES_PASSWORD}@postgres:5432/prefect?sslmode=disable>
PREFECT_SERVER_DATABASE_CONNECTION_TIMEOUT: "30"
PREFECT_SERVER_DATABASE_TIMEOUT: "60"
PREFECT_SERVER_DATABASE_SQLALCHEMY_POOL_SIZE: "10"
PREFECT_SERVER_DATABASE_SQLALCHEMY_MAX_OVERFLOW: "20"
# Redis config unchanged...
B) Give Postgres CPU and cap services CPU
This prevents the starvation loop:
services:
postgres:
# ...
deploy:
resources:
limits:
cpus: "2.0"
memory: 2g
reservations:
cpus: "1.0"
prefect-background:
# ...
deploy:
resources:
limits:
cpus: "1.0"
memory: 1g
If you’re not using Swarm/Compose v3 deploy, you can use:
cpus: "1.0"
mem_limit: "1g"
C) Slightly slow down service loops (reduces “retry churn”)
These won’t materially affect latency for your scale, but reduce pressure while you tune DB:
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_LOOP_SECONDS=5
PREFECT_SERVER_SERVICES_REPOSSESSOR_LOOP_SECONDS=15
PREFECT_SERVER_SERVICES_LATE_RUNS_LOOP_SECONDS=15
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_LOOP_SECONDS=60
# If you don't use concurrency limits:
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
D) Enable Postgres slow query logging while you testMarvin
09/02/2025, 12:58 PMpostgres:
image: postgres:15
command:
- "postgres"
- "-c"
- "log_min_duration_statement=1000"
- "-c"
- "max_connections=200"
- "-c"
- "shared_buffers=256MB"
Why a fresh DB can still be “slow”
- If the services container hogs CPU, Postgres can’t respond promptly even for simple queries (connects/metadata) — you’ll see >10s waits and timeouts.
- SSL negotiation attempts on every connect can add latency or fail if misconfigured.
- Connection pool defaults can force more connects under load; with short timeouts that manifests as timeouts rather than true long-running queries.
- If API/SERVER DB settings are mixed up, one of the processes may be defaulting to unexpected behavior (including SSL attempts, tiny timeouts).
References
- Settings reference: Prefect settings reference
- Database configuration API (shows the timeout/pool knobs): Database configuration API
If you can:
- Apply the prefix fix + sslmode=disable + pool/timeouts
- Cap services CPU and give Postgres a bit more
- Share a snippet of Postgres slow query logs (just one example) and whether the timeouts go away
I can help confirm if anything else needs tuning (e.g., specific indexes or loop intervals).Vincent Chéry
09/02/2025, 1:10 PM?sslmode=disable
appended to the connection string I get
TypeError: connect() got an unexpected keyword argument 'sslmode'
Marvin
09/02/2025, 1:10 PMMarvin
09/02/2025, 1:11 PMpostgresql+asyncpg://...
).
- sslmode=
is a psycopg/psql parameter. asyncpg does not accept sslmode
and raises: “connect() got an unexpected keyword argument 'sslmode'”.
- For asyncpg, use ssl=false
(or omit SSL entirely) instead of sslmode=disable
.
What to use instead
- Disable SSL on the internal Docker network with ?ssl=false
.
Also, make sure the API/UI and services containers use the correct prefixes (they’re swapped in your compose):
- In prefect-api (API/UI):
PREFECT_API_DATABASE_CONNECTION_URL=<postgresql+asyncpg://prefect:${POSTGRES_PASSWORD}@postgres:5432/prefect?ssl=false>
PREFECT_API_DATABASE_CONNECTION_TIMEOUT=30
PREFECT_API_DATABASE_TIMEOUT=60
PREFECT_API_DATABASE_SQLALCHEMY_POOL_SIZE=10
PREFECT_API_DATABASE_SQLALCHEMY_MAX_OVERFLOW=20
- In prefect-background (services):
PREFECT_SERVER_DATABASE_CONNECTION_URL=<postgresql+asyncpg://prefect:${POSTGRES_PASSWORD}@postgres:5432/prefect?ssl=false>
PREFECT_SERVER_DATABASE_CONNECTION_TIMEOUT=30
PREFECT_SERVER_DATABASE_TIMEOUT=60
PREFECT_SERVER_DATABASE_SQLALCHEMY_POOL_SIZE=10
PREFECT_SERVER_DATABASE_SQLALCHEMY_MAX_OVERFLOW=20
Notes
- Remove the angle brackets in your redacted example — they’ll break parsing (e.g., use `postgresql+asyncpg://prefect:${POSTGRES_PASSWORD}@postgres:5432/prefect?ssl=false`).
- Keeping the pool/timeouts from above + ssl=false
should eliminate the connect/handshake delays and reduce those 10s “slow queries” that were really connection waits.
References
- Prefect settings reference: Settings reference
- Prefect DB configuration API (shows asyncpg options): Database configuration API
If you update the URLs as above and still see timeouts, send one example from Postgres logs with duration and query text — we can dig into any remaining slow statements.Vincent Chéry
09/02/2025, 4:06 PMidle in transaction
in postgres. See :
> SELECT datname, state, (now() - pg_stat_activity.query_start) as time_running, query FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '1 second' ORDER BY 3;
datname | state | time_running | query
---------+---------------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
prefect | idle in transaction | 00:00:17.271294 | DELETE FROM automation_event_follower WHERE automation_event_follower.scope = $1::VARCHAR AND automation_event_follower.follower_event_id = $2::UUID
prefect | idle in transaction | 00:00:17.276217 | SELECT work_pool.id AS work_pool_id, work_pool.name AS work_pool_name, work_pool.description AS work_pool_description, work_pool.type AS work_pool_type, work_pool.base_job_template AS work_pool_base_job_template, work_pool.is_paused AS work_pool_is_paused, work_pool.default_queue_id AS work_pool_default_queue_id, work_pool.concurrency_limit AS work_pool_concurrency_limit, work_pool.status AS work_pool_status, work_pool.last_transitioned_status_at AS work_pool_last_transitioned_status_at, work_pool.last_status_event_id AS work_pool_last_status_event_id, work_pool.storage_configuration AS work_pool_storage_configuration, work_pool.created AS work_pool_created, work_pool.updated AS work_pool_updated +
| | | FROM work_pool +
| | | WHERE work_pool.id IN ($1::UUID)
prefect | idle in transaction | 00:00:17.27632 | ;
prefect | idle in transaction | 00:00:17.276367 | ;
prefect | idle in transaction | 00:00:17.276432 | BEGIN;
prefect | active | 00:00:17.276509 | ;
prefect | idle in transaction | 00:00:17.289364 | ;
prefect | idle in transaction | 00:00:17.289681 | BEGIN;
prefect | idle in transaction | 00:00:17.300685 | SELECT deployment_schedule.deployment_id, deployment_schedule.schedule, deployment_schedule.active, deployment_schedule.max_scheduled_runs, deployment_schedule.parameters, deployment_schedule.slug, deployment_schedule.id, deployment_schedule.created, deployment_schedule.updated +
| | | FROM deployment_schedule +
| | | WHERE deployment_schedule.deployment_id = $1::UUID AND deployment_schedule.active IS true ORDER BY deployment_schedule.updated DESC
prefect | idle | 00:02:26.987803 | ROLLBACK;
prefect | idle | 00:24:24.62834 | LISTEN "prefect_automation_changes"
prefect | idle | 00:24:30.749047 | LISTEN "prefect_automation_changes"
Marvin
09/02/2025, 4:06 PMMarvin
09/02/2025, 4:12 PMlisten_for_automation_changes
).
- These connections will typically sit “idle” for a long time. They are not in a transaction and are harmless.
2) idle in transaction sessions with queries like DELETE FROM automation_event_follower…, SELECT … FROM work_pool …
- “idle in transaction” means a transaction was opened (BEGIN
) and not yet committed/rolled back. Short-lived idles (a few seconds) can occur during normal request handling, but long-lived ones (minutes+) are not expected and can cause autovacuum bloat or lock contention.
- Prefect’s API uses async SQLAlchemy sessions and transactions under the hood (see DB config: database/configurations.py and ORM models including `AutomationEventFollower`: database/orm_models.py). Those DELETEs you see are part of the automations/event follower cleanup path.
A few targeted recommendations
- First, determine if these are transient or persistent:
- Run:
SELECT pid, application_name, state, xact_start, (now() - xact_start) AS xact_age, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE datname = 'prefect' AND state = 'idle in transaction'
ORDER BY xact_start;
- If you see “idle in transaction” persisting for minutes/hours, that’s worth action. If it’s 10–30s and clears, that’s typically fine.
- Protect the database with Postgres’ idle-in-transaction timeout
- Configure idle_in_transaction_session_timeout
at the DB or role level to automatically terminate sessions that idle in a transaction too long. Many teams use 60s–5m:
ALTER DATABASE prefect SET idle_in_transaction_session_timeout = '60s';
-- or at the role level:
ALTER ROLE <prefect_db_user> SET idle_in_transaction_session_timeout = '60s';
- This prevents bloat from orphaned transactions if anything downstream misbehaves.
- Keep Prefect up to date
- If you’re not on the latest 3.x, upgrade. There are ongoing improvements in the automations/events path and DB handling. Latest notes: Prefect releases (3.4.15 notes: link).
- Tune connection pooling if you see many concurrent sessions
- Prefect’s async Postgres configuration supports pool and overflow limits: sqlalchemy_pool_size
, sqlalchemy_max_overflow
, etc. See the config class (parameters listed here): AsyncPostgresConfiguration.
- If you’re seeing a lot of concurrent API traffic and spikes of short-lived transactions, reducing pool size/overflow or increasing it (depending on your DB capacity) can stabilize session churn.
- Sanity-check Postgres version
- Prefect 3.x docs recommend Postgres 14.9+ (documented recently: release notes).
If you’re still seeing a buildup of “idle in transaction” that lasts for minutes+:
- Please share:
- Prefect version
- Self-hosted vs. Cloud
- Postgres version
- How long these idle-in-transaction sessions persist and whether they correlate with specific API calls or background services (e.g., automations changes, scheduler)Marvin
09/02/2025, 4:12 PMVincent Chéry
09/02/2025, 4:15 PMdatname | state | time_running | query
---------+--------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
prefect | active | 00:00:28.720579 | INSERT INTO task_run (flow_run_id, task_key, dynamic_key, flow_run_run_count, empirical_policy, task_inputs, tags, labels, state_id, name, state_type, state_name, state_timestamp, run_count, expected_start_time, total_run_time, id, created, updated) VALUES ($1::UUID, $2::VARCHAR, $3::VARCHAR, $4::INTEGER, $5, $6, $7::JSONB, $8::JSONB, $9::UUID, $10::VARCHAR, $11::state_type, $12::VARCHAR, $13::TIMESTAMP WITH TIME ZONE, $14::INTEGER, $15::TIMESTAMP WITH TIME ZONE, $16::INTERVAL, $17::UUID, $18::TIMESTAMP WITH TIME ZONE, $19::TIMESTAMP WITH TIME ZONE) ON CONFLICT (id) DO UPDATE SET flow_run_id = $20::UUID, task_key = $21::VARCHAR, dynamic_key = $22::VARCHAR, flow_run_run_count = $23::INTEGER, empirical_policy = $24, task_inputs = $25, tags = $26::JSONB, labels = $27::JSONB, state_id = $28::UUID, name = $29::VARCHAR, state_type = $30::state_type, state_name = $31::VARCHAR, state_timestamp = $32::TIMESTAMP WITH TIME ZONE, run_count = $33::INTEGER, expected_start_time = $34::TIMESTAMP WITH TIME ZONE, total_run_ti
Marvin
09/02/2025, 4:16 PMMarvin
09/02/2025, 4:16 PMMarvin
09/02/2025, 4:17 PMMarvin
09/02/2025, 4:19 PMtask_run
is a very common write path in Prefect (created and updated on task state transitions), but if it’s “active” for a long time it’s usually waiting on a lock or I/O rather than actually doing work.
What to check right now
- Is it actually waiting on a lock?
SELECT pid, state, wait_event_type, wait_event,
now() - query_start AS age, query
FROM pg_stat_activity
WHERE datname = 'prefect' AND pid = <that_pid>;
If wait_event_type = 'Lock'
, it’s blocked by another session.
- Who is blocking it?
SELECT
a.pid AS blocked_pid,
a.wait_event_type, a.wait_event,
now() - a.query_start AS blocked_for,
a.query AS blocked_query,
b.pid AS blocking_pid,
now() - b.query_start AS blocking_for,
b.state AS blocking_state,
b.query AS blocking_query,
l.relation::regclass AS blocked_relation,
l.locktype, l.mode
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid AND NOT l.granted
JOIN pg_locks bl ON bl.locktype = l.locktype
AND bl.database IS NOT DISTINCT FROM l.database
AND bl.relation IS NOT DISTINCT FROM l.relation
AND bl.page IS NOT DISTINCT FROM l.page
AND bl.tuple IS NOT DISTINCT FROM l.tuple
AND bl.virtualxid IS NOT DISTINCT FROM l.virtualxid
AND bl.transactionid IS NOT DISTINCT FROM l.transactionid
AND bl.classid IS NOT DISTINCT FROM l.classid
AND bl.objid IS NOT DISTINCT FROM l.objid
AND bl.objsubid IS NOT DISTINCT FROM l.objsubid
JOIN pg_stat_activity b ON b.pid = bl.pid
WHERE a.datname = 'prefect' AND a.pid = <that_pid>;
This will tell you exactly which session/table is holding the lock.
- See all blockers at once:
SELECT pid,
pg_blocking_pids(pid) AS blocking_pids,
state, wait_event_type, wait_event,
now() - query_start AS age,
query
FROM pg_stat_activity
WHERE datname = 'prefect' AND (wait_event_type IS NOT NULL OR state='idle in transaction')
ORDER BY age DESC;
Common root causes we see
- An “idle in transaction” session is the blocker. If a request opened a transaction and never committed/rolled back, it can hold row/table locks that block the task_run
UPSERT. This matches the earlier “idle in transaction” you’re seeing.
- FK contention: task_run.flow_run_id
has an FK to flow_run
. If another transaction is updating/deleting the referenced flow_run
row, your insert can wait on that row lock.
- Schema changes/migrations or maintenance running concurrently.
- Autovacuum not keeping up on hot tables (e.g., task_run
, task_run_state
, flow_run
, flow_run_state
) leading to bloat and slower constraint/index checks.
Immediate mitigations
- Set a safety net for idle transactions:
ALTER DATABASE prefect SET idle_in_transaction_session_timeout = '60s';
-- or at the role level:
ALTER ROLE <prefect_user> SET idle_in_transaction_session_timeout = '60s';
This prevents idle transactions from lingering and blocking writers.
- Optionally set a statement timeout to prevent runaway writes:
ALTER DATABASE prefect SET statement_timeout = '60s';
Use with care — it will surface as errors to the client if statements legitimately take longer.
- If you find a clear blocker, you can terminate it:
SELECT pg_terminate_backend(<blocking_pid>);
Only do this if you know it’s safe for your workload.
Longer-term checks
- Make sure you’re on a current Prefect 3.x release; we regularly improve DB interactions. Releases: Prefect releases. Task/flow ORM and DB config are here:
- ORM models (includes TaskRun/FlowRun definitions): orm_models.pyMarvin
09/02/2025, 4:19 PMtask_run
, task_run_state
, flow_run
, flow_run_state
, log
Example inspection:
SELECT relname, n_live_tup, n_dead_tup, vacuum_count, autovacuum_count
FROM pg_stat_user_tables
WHERE relname IN ('task_run','task_run_state','flow_run','flow_run_state','log')
ORDER BY n_dead_tup DESC;
- If you have high API/worker concurrency, right-size the SQLAlchemy pool to match DB capacity to reduce connection/lock thrash. See sqlalchemy_pool_size
and sqlalchemy_max_overflow
in AsyncPostgresConfiguration.
If you can share:
- Prefect version, self-hosted or Cloud
- Postgres version
- The output of the blocker queries above (wait_event_type, relation, blocking_pid/blocking_query)
- Typical duration of these “active” sessions and whether they correlate with spikes in flow/task activity
…I can help pinpoint exactly which Prefect path (e.g., a specific API route, automations cleanup, scheduler) is holding the lock and suggest a targeted fix.Vincent Chéry
09/02/2025, 4:38 PMINSERT INTO events
active for 50 seconds and the wait_event
is ClientReadMarvin
09/02/2025, 4:38 PMMarvin
09/02/2025, 4:39 PMSELECT pid, application_name, client_addr, usename, backend_start, query_start,
state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE pid = <pid>;
- application_name is settable by Prefect (see connection_app_name
in the DB config: AsyncPostgresConfiguration). This helps attribute sessions to API pods/workers.
- Correlate with API logs/metrics around that time
- Check the Prefect API container logs for slow handler warnings or exceptions around that timestamp.
- Look for CPU saturation, event loop blocking, or GC pauses on the API pod/host.
- If you’re using PgBouncer or a reverse proxy, check its logs too.
Network/infra considerations
- If there’s a proxy between API and Postgres (PgBouncer, LB, sidecar), ClientRead can be the database waiting on the proxy which is waiting on the API. Check for idle connection timeouts and buffering limits there.
- Enable TCP keepalives on Postgres to detect dead or stuck clients faster:
ALTER SYSTEM SET tcp_keepalives_idle = 30;
ALTER SYSTEM SET tcp_keepalives_interval = 10;
ALTER SYSTEM SET tcp_keepalives_count = 3;
SELECT pg_reload_conf();
- Consider a conservative statement timeout to surface these stalls as errors instead of long-lived transactions:
ALTER DATABASE prefect SET statement_timeout = '60s';
Be aware this will error out long-running legitimate statements, so tune by environment.
Prefect specifics you can review
- The events write path is handled by the API endpoint here: server/api/events.py (function create_events
), and events clients are here: server/events/clients.py. It should open a session, insert, and commit promptly. If the API process is starved or blocked mid-request, the transaction can linger.
- DB connection/pooling settings (pool size, overflow, statement cache, application_name) are here: database/configurations.py. Mismatch between pool size and DB capacity can amplify backpressure.
- Make sure you’re on a current 3.x release: Prefect releases. We continue to improve DB interaction and events handling.
If you can share:
- Prefect version and how the API is deployed (k8s/gunicorn/uvicorn workers, PgBouncer yes/no)
- Postgres version
- Whether the ClientRead cases align with load spikes (many event emissions) or specific pods
- application_name/client_addr for those PIDs and concurrent API logs
…I can help pinpoint whether this is event-loop blocking in the API, proxy behavior, or something else. In the meantime, adding a reasonable statement_timeout
and enabling TCP keepalives will keep these from accumulating and aid diagnosis.Vincent Chéry
09/04/2025, 8:25 AMMarvin
09/04/2025, 8:25 AMMarvin
09/04/2025, 8:30 AMprefect server start --no-services
- Run only the background services with prefect server services start
- Enable/disable specific services via environment variables
Below is a working Compose example that:
- Starts Postgres
- Runs the API/UI in one container
- Splits services across two containers:
- services-a: Scheduler + TaskRunRecorder
- services-b: all other services
version: "3.9"
services:
postgres:
image: postgres:15
environment:
POSTGRES_USER: prefect
POSTGRES_PASSWORD: prefect
POSTGRES_DB: prefect
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U prefect -d prefect"]
interval: 5s
timeout: 5s
retries: 20
# API/UI only (no background services)
prefect-api:
image: prefecthq/prefect:3-latest
depends_on:
postgres:
condition: service_healthy
environment:
PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect:prefect@postgres:5432/prefect>
PREFECT_UI_ENABLED: "true"
PREFECT_SERVER_LOGGING_LEVEL: INFO
# Keep API/UI in this container; services run elsewhere
command: >
bash -lc "prefect server start --no-services --host 0.0.0.0 --port 4200"
ports:
- "4200:4200"
restart: unless-stopped
# Services container A: Scheduler + TaskRunRecorder only
prefect-services-a:
image: prefecthq/prefect:3-latest
depends_on:
postgres:
condition: service_healthy
environment:
PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect:prefect@postgres:5432/prefect>
# Enable only the services you want in this container
PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED: "true"
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED: "true"
# Disable everything else here
PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED: "false"
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED: "false"
PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED: "false"
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED: "false"
PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED: "false"
PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED: "false"
PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED: "false"
PREFECT_API_EVENTS_STREAM_OUT_ENABLED: "false" # Distributor
PREFECT_SERVER_ANALYTICS_ENABLED: "false" # Telemetry
command: >
bash -lc "prefect server services start"
restart: unless-stopped
# Services container B: everything else (no Scheduler/Recorder)
prefect-services-b:
image: prefecthq/prefect:3-latest
depends_on:
postgres:
condition: service_healthy
environment:
PREFECT_API_DATABASE_CONNECTION_URL: <postgresql+asyncpg://prefect:prefect@postgres:5432/prefect>
# Disable scheduler + recorder in this container
PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED: "false"
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED: "false"
# Enable the rest here (tune to your needs)
PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED: "true"
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED: "true"
PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED: "true"
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED: "true"
PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED: "true"
PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED: "true"
PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED: "true"
PREFECT_API_EVENTS_STREAM_OUT_ENABLED: "true" # Distributor
PREFECT_SERVER_ANALYTICS_ENABLED: "true" # Telemetry
command: >
bash -lc "prefect server services start"
restart: unless-stopped
volumes:
pgdata:
How this works:
- prefect-api
runs only the webserver (API/UI) via --no-services
Marvin
09/04/2025, 8:30 AMprefect-services-a
and prefect-services-b
both run prefect server services start
but you control which services are active via env vars
- All containers share the same Postgres database via PREFECT_API_DATABASE_CONNECTION_URL
Useful commands and tuning:
- See what services exist and the exact env var names:
- prefect server services list-services
- Start services in the foreground/background (in a container we run foreground):
- prefect server services start
- You can toggle services via env vars (true/false). Common ones:
- Scheduler: PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED
- TaskRunRecorder: PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED
- Late runs: PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED
- Triggers: PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED
- Event logger/persister: PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED
, PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED
- Distributor: PREFECT_API_EVENTS_STREAM_OUT_ENABLED
- Telemetry: PREFECT_SERVER_ANALYTICS_ENABLED
Links:
- Self-hosted with Docker Compose
- prefect server CLI reference
- Settings reference (env vars)
Questions to confirm:
- Do you want the API/UI to live in one of the service containers instead of a separate prefect-api
container? If so, we can collapse them, but keeping them separate is typically cleaner.
- Any other services you specifically want grouped with scheduler/recorder? I can tailor the env toggles to match.Vincent Chéry
09/04/2025, 4:04 PMMarvin
09/04/2025, 4:04 PMMarvin
09/04/2025, 4:05 PMprefect server start --no-services --host 0.0.0.0 --port 4200
- Purpose: Serves the REST API and UI, no background loops here
- Notes: You can toggle UI with --ui/--no-ui
if needed
Container B: Core scheduling services (time‑sensitive orchestration)
Enable:
- PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=true
(Scheduler)
- PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=true
(TaskRunRecorder)
- PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED=true
(MarkLateRuns)
- PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED=true
(FailExpiredPauses)
- PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED=true
(CancellationCleanup)
- PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=true
(Foreman)
Disable here:
- PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED=false
- PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=false
- PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
- PREFECT_API_EVENTS_STREAM_OUT_ENABLED=false
(Distributor)
- PREFECT_SERVER_ANALYTICS_ENABLED=false
(Telemetry; optional preference)
Run: prefect server services start
Container C: Events and triggers (bursty/background)
Enable:
- PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED=true
(Proactive/Reactive/Actions)
- PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=true
- PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=true
- PREFECT_API_EVENTS_STREAM_OUT_ENABLED=true
(Distributor)
- PREFECT_SERVER_ANALYTICS_ENABLED=true
(Telemetry; optional)
Disable here:
- PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=false
- PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=false
- PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED=false
- PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED=false
- PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED=false
- PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
Run: prefect server services start
Important notes
- Single instance per service: Ensure each service is enabled in exactly one container to avoid duplicate processing. The env toggles above enforce that.
- Database: All containers point to the same Postgres via PREFECT_API_DATABASE_CONNECTION_URL
.
- Health and resilience: Use restart policies and healthchecks; if a services container fails, the other containers keep running.
- Scaling: These loops are designed to run as singletons. Scale by allocating more CPU/RAM to the appropriate container(s), not by running duplicates of the same service.
When to choose fewer containers
- Dev/small: 2 containers is fine
- API/UI
- All services together (enable everything in one)
- Medium/most prod: 3 containers (as above)
- Specialized/high‑throughput events: Keep 3, but give the “events and triggers” container extra CPU/memory
Handy commands and references
- List services and confirm env var names:
- prefect server services list-services
- CLI reference: Prefect server CLI
- Settings/env vars: Settings reference
- Docker Compose self‑hosting guide: Self-hosted with Docker Compose
If you share your workload characteristics (number of runs/day, use of triggers/events, expected spikes), I can tailor the exact split and sizing further.Vincent Chéry
09/04/2025, 4:07 PMMarvin
09/04/2025, 4:07 PMMarvin
09/04/2025, 4:09 PMprefect server services list-services
- There’s no toggle like PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED
- Its responsibilities are effectively covered by other orchestration loops (e.g., Foreman, CancellationCleanup, LateRuns, TaskRunRecorder)
So you don’t need to account for a separate “Repossessor” in your container split. If you’re following the 3-container recommendation, just place the nearest analogs in the “Core scheduling” container:
Core scheduling container should include:
- Scheduler (and RecentDeploymentsScheduler via the same flag)
- TaskRunRecorder
- LateRuns
- PauseExpirations
- CancellationCleanup
- Foreman
- If you’re on a custom/older build where Repossessor shows up, put it here too
To confirm what’s actually available in your build, run:
prefect server services list-services
References:
- Prefect server CLI
- Settings referenceVincent Chéry
09/05/2025, 7:24 AMMarvin
09/05/2025, 7:24 AMMarvin
09/05/2025, 7:28 AMPREFECT_API_SERVICES_SCHEDULER_
. The most relevant knobs:
- LOOP_SECONDS — how often the scheduler wakes up. Increase to reduce wakeups.
- DEPLOYMENT_BATCH_SIZE — how many deployments it evaluates per cycle. Decrease to reduce per-loop work.
- INSERT_BATCH_SIZE — how many flow runs it inserts per transaction. Decrease to reduce bursty DB work.
- MAX_RUNS — cap on how many runs the scheduler will create each cycle. Decrease to throttle work.
- MIN_SCHEDULED_TIME — how far into the past the scheduler will create runs (affects “catch-up”). Reduce to avoid backfilling.
- MAX_SCHEDULED_TIME — how far into the future to schedule runs. Reduce to shrink planning window.
- RECENT_DEPLOYMENTS_LOOP_SECONDS — how often the “recent deployments” sub-loop runs; increase to reduce churn shortly after creating/updating deployments.
Example settings you can try (docker-compose or k8s env):
PREFECT_API_SERVICES_SCHEDULER_LOOP_SECONDS=60
PREFECT_API_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE=50
PREFECT_API_SERVICES_SCHEDULER_INSERT_BATCH_SIZE=100
PREFECT_API_SERVICES_SCHEDULER_MAX_RUNS=50
PREFECT_API_SERVICES_SCHEDULER_MIN_SCHEDULED_TIME=PT5M
PREFECT_API_SERVICES_SCHEDULER_MAX_SCHEDULED_TIME=P14D
PREFECT_API_SERVICES_SCHEDULER_RECENT_DEPLOYMENTS_LOOP_SECONDS=30
Notes and gotchas
- Backfill/catch-up: Prefect does not have a global “catchup=False” flag. By default, the scheduler will create runs for a time window that includes some recent past. If your server was down or deployments were just created with a start time far in the past, the scheduler can work hard to catch up. To minimize backfill:
- Set PREFECT_API_SERVICES_SCHEDULER_MIN_SCHEDULED_TIME
small (e.g., PT0S
or PT5M
).
- Start schedules “now” or slightly in the future when creating deployments so there aren’t many historical occurrences to compute.
- If you don’t need a schedule for some deployments, remove it (use NoSchedule
) and trigger runs ad hoc.
- Schedule complexity: Many per-minute cron schedules or complex RRules across many deployments can increase CPU. If possible, prefer a slightly lower frequency or simpler schedules.
- Database: Using Postgres is recommended. SQLite will exacerbate scheduler work. Also ensure your DB has sufficient CPU/IO—some “scheduler CPU” can actually be time spent in DB calls.
- Single-core saturation: The scheduler is a single Python process; 100% of one core is “full” for it. On small machines, that can look like high overall CPU even if it’s just one core.
Docs
- Settings reference (lists the scheduler env vars and defaults): Prefect 3 settings reference
- Scheduling concepts: Prefect 3 schedules
If you can share a bit more, I can tailor the guidance:
- Exact Prefect version
- Your DB backend (Postgres vs SQLite)
- A sample of your schedules (cron/interval and frequencies)
- Whether the high CPU is steady-state or only after restarts/downtime
With that, I can suggest concrete values for the env vars above and spot any schedule patterns that are causing extra work.