<@ULVA73B9P> is there a performance difference bet...
# ask-marvin
a
@Marvin is there a performance difference between A) process workers vs docker workers and B) many instances of deployments each with their own schedule vs one deployment with many schedules? I am testing ~80 schedules (ranging between 2-30min each) of a deployment of a flow and it seems to slow prefect's performance to a crawl
m
thought for 297.8 seconds
Great questions — and sorry you’re seeing things “crawl.” Short answer - A) Process vs Docker workers: Process workers have lower per-run overhead and will generally feel “snappier” for high-frequency/short-duration flows. Docker workers add container startup/image checks which increases latency per run, but offer isolation and reproducibility. This does not affect the scheduler directly, but it can impact perceived throughput. - B) Many deployments each with one schedule vs one deployment with many schedules: The scheduler iterates over deployments in batches, so one deployment with many schedules is typically more efficient than many deployments with one schedule each. However, the total number of future runs generated still matters — that’s usually the real source of “crawl.” What’s likely happening With ~80 schedules at 2–30 minute intervals, Prefect’s scheduler will try to keep a horizon of future runs available. By default, it maintains at least 1 hour of upcoming runs per deployment (and at least 3 runs), which can easily balloon to thousands of scheduled flow runs and stress your API/DB/UI. Key knobs to tune in Prefect 3.x You can significantly reduce load by tightening the scheduler horizon and batch sizes. These settings apply to Prefect Server; for Prefect Cloud use the same names but with the
PREFECT_API_*
prefix instead of
PREFECT_SERVER_*
. - Loop cadence and batching -
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS
(default 60) -
PREFECT_SERVER_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE
(default 100) -
PREFECT_SERVER_SERVICES_SCHEDULER_INSERT_BATCH_SIZE
(default 500) - Horizon and run count guardrails -
PREFECT_SERVER_SERVICES_SCHEDULER_MIN_SCHEDULED_TIME
(default 1h) -
PREFECT_SERVER_SERVICES_SCHEDULER_MAX_SCHEDULED_TIME
(default 100d) -
PREFECT_SERVER_SERVICES_SCHEDULER_MIN_RUNS
(default 3) -
PREFECT_SERVER_SERVICES_SCHEDULER_MAX_RUNS
(default 100) - Recent deployments loop -
PREFECT_SERVER_SERVICES_SCHEDULER_RECENT_DEPLOYMENTS_LOOP_SECONDS
(default 5) Docs: - Schedules (concepts) - Settings reference (see Server services -> Scheduler) Recommended changes for your case - Reduce the scheduler horizon and run counts: - Set a smaller horizon for high-frequency schedules (e.g., 5–10 minutes). - Lower min runs if your schedules are frequent (e.g., 1). - Cap max runs to a small number (e.g., 10–20). - Disable catchup on these schedules so the scheduler doesn’t try to create a backlog of past runs when it falls behind or on redeploy. - If you have many deployments, consider consolidating schedules into fewer deployments (preferably one deployment with many schedules) to reduce per-loop work. Example (Server) — put these in your profile or environment
Copy code
PREFECT_SERVER_SERVICES_SCHEDULER_MIN_SCHEDULED_TIME=PT10M
PREFECT_SERVER_SERVICES_SCHEDULER_MIN_RUNS=1
PREFECT_SERVER_SERVICES_SCHEDULER_MAX_RUNS=10
# Optional: if you have many deployments and want the scheduler to keep up
PREFECT_SERVER_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE=200
# Optional: if you’re inserting many runs in bursts and your DB can handle it
PREFECT_SERVER_SERVICES_SCHEDULER_INSERT_BATCH_SIZE=1000
For Prefect Cloud, use the same variables with
PREFECT_API_SERVICES_SCHEDULER_*
instead. Worker choice (process vs Docker) - Process worker - Pros: Minimal startup overhead, best for lots of short/fast runs; higher perceived throughput on the same machine. - Cons: Less isolation; you must manage Python/env deps on the machine. - Docker worker - Pros: Strong isolation, reproducible runtime, easy dependency management. - Cons: Extra overhead to start containers/pull images; can be noticeable at 2–5 minute intervals with short flows. Other considerations
- Database: If you’re self-hosting and using SQLite, move to Postgres for this scale; the volume of scheduled runs and state changes can overwhelm SQLite quickly. - Concurrency: Use global or work-pool concurrency to prevent a flood of simultaneous runs if that’s contributing to API/DB load. See Global concurrency limits. - Monitor scheduler logs/metrics: Look for slow DB writes or large batches of created runs on each loop. A few clarifying questions to help pinpoint the fix: - Are you on Prefect Cloud or self-hosted Server? If Server, which DB backend? - Approximately how long do individual runs take and how “chatty” are they (logs, state updates)? - Is the “crawl” showing up in the UI responsiveness, API latency, worker start times, or all of the above? - Do your schedules have
catchup
enabled? If you share those details, I can suggest specific values for the scheduler settings and any DB tuning that’ll get you back to quick operations.
a
@Marvin Self hosted, Postgres. Runs are 10s to 60s generally (but really should be 2-10sec). The UI is slow and the API is slow. Also I think fetching blocks etc is slow, and perhaps this cascades. I think catchup is enabled, as I see a lot of late runs
m
thought for 523.3 seconds
Thanks — that context helps a lot. What’s going on - With ~80 frequent schedules, the scheduler is likely generating a large number of future flow runs. That balloons your flow_runs table and makes UI/API queries slow. - “Late runs” indicate the scheduler created runs whose scheduled time has already passed (e.g., due to backlog). There isn’t a “catchup=False” switch; you control this behavior by tightening the scheduler’s horizon and by how you create schedules (anchor date), plus pausing/resuming schedules during changes. - For very short runs (2–10s), Docker startup overhead is significant. Process workers will feel much faster. This doesn’t fix API/UI slowness, but will improve end-to-end latency. Immediate steps to stabilize 1) Pause schedules and stop creating more runs - List schedules and pause them (repeat per schedule): -
prefect deployment schedule ls
-
prefect deployment schedule pause <deployment_name> <schedule_id>
- Optionally set a temporary work pool concurrency limit to 0 to stop execution while you clean up: -
prefect work-pool set-concurrency-limit <pool_name> 0
2) Reduce the scheduler’s horizon and run counts (Server env) Set these and restart the API:
Copy code
PREFECT_SERVER_SERVICES_SCHEDULER_MIN_SCHEDULED_TIME=PT5M
PREFECT_SERVER_SERVICES_SCHEDULER_MIN_RUNS=1
PREFECT_SERVER_SERVICES_SCHEDULER_MAX_RUNS=10
# Optional: if you have lots of deployments/schedules
PREFECT_SERVER_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE=200
# Optional: if inserts are efficient on your DB
PREFECT_SERVER_SERVICES_SCHEDULER_INSERT_BATCH_SIZE=1000
Docs: Schedules, Settings reference 3) Recreate schedules “from now” so they don’t backfill - Clear schedules on the deployment: -
prefect deployment schedule clear <deployment_name>
- Recreate schedules with an anchor date at or after now (prevents backfilling older intervals): -
prefect deployment schedule create <deployment_name> --interval 600 --anchor-date "2025-09-23T12:00:00Z"
- Repeat per schedule with your desired interval and a recent anchor-date. - Resume schedules or restore your work pool concurrency. 4) Switch to process workers for short jobs - Process workers have minimal startup overhead and will help 2–10s flows hit your target runtimes. - Docker workers add container/image overhead that’s noticeable at 2–5 minute intervals and short runs. 5) Database pool sizing (Postgres) Prefect’s API defaults are conservative. Increase the SQLAlchemy connection pool for the server:
Copy code
PREFECT_SERVER_DATABASE_POOL_SIZE=20
PREFECT_SERVER_DATABASE_MAX_OVERFLOW=40
PREFECT_SERVER_DATABASE_POOL_TIMEOUT=30
PREFECT_SERVER_DATABASE_POOL_RECYCLE=1800
After setting, restart the API. If your Postgres is under-provisioned for concurrent connections, increase its max_connections and ensure autovacuum is keeping up. 6) Reduce block/API chatter in short runs - Load blocks once per run and reuse the object (avoid repeated
Block.load(...)
calls inside many short tasks). - For hottest secrets/creds, consider passing them via environment variables in your work pool’s base job template instead of pulling blocks on every run. - If possible, use provider-native auth (e.g., IAM/Workload Identity) to avoid any block fetch. Why “one deployment with many schedules” can still be heavy - It is generally better than many deployments, but the main cost is total future runs created. Tightening the scheduler horizon and min/max run counts is what actually reduces load. Optional: Late runs service If you see excessive chatter from “late runs” handling, you can tune it:
Copy code
PREFECT_SERVER_SERVICES_LATE_RUNS_LOOP_SECONDS=5
PREFECT_SERVER_SERVICES_LATE_RUNS_AFTER_SECONDS=PT30S
See settings: Settings reference A couple of quick questions so I can give you concrete values:
- Roughly how many scheduled (future) flow runs do you see at peak? If you can, check the count of SCHEDULED runs in Postgres. - How many concurrent workers are polling the work pool? - Are blocks being fetched multiple times per run, or just once at startup? If you prefer, I can suggest specific final values for the scheduler and DB settings based on your answers.
a
@Marvin I made those env var changes. With 60-80 upcoming flows things seem ok, but I have hit 140 and I am seeing late runs and the UI getting sluggish
m
thought for 65.4 seconds
Thanks for the update — that’s a useful signal. If things are smooth around ~60–80 upcoming runs but get sluggish at ~140, you’re likely hitting a throughput threshold across a few layers (scheduler loop, API queries/DB, workers). Here’s how I’d push the threshold higher and reduce “late runs”: Scheduler: tighten further and speed the loop - Reduce horizon and run counts again (aim for just-in-time scheduling): -
PREFECT_SERVER_SERVICES_SCHEDULER_MIN_SCHEDULED_TIME=PT2M
-
PREFECT_SERVER_SERVICES_SCHEDULER_MIN_RUNS=1
-
PREFECT_SERVER_SERVICES_SCHEDULER_MAX_RUNS=5
- Make the scheduler loop more responsive: -
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
- Keep batch sizes tuned up if your DB can handle it: -
PREFECT_SERVER_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE=200
-
PREFECT_SERVER_SERVICES_SCHEDULER_INSERT_BATCH_SIZE=1000
- If you see a lot of “late runs” churn, dampen late-run handling noise: -
PREFECT_SERVER_SERVICES_LATE_RUNS_AFTER_SECONDS=PT60S
Docs: Schedules, Settings reference Make sure schedules don’t backfill - Clear and recreate schedules with a recent
--anchor-date
so only future runs get created: -
prefect deployment schedule clear <deployment_name>
-
prefect deployment schedule create <deployment_name> --interval 120 --anchor-date "<recent-UTC-ISO8601>"
Increase execution throughput (reduce late runs) - Use process workers for these short runs (2–10s) — they’re much snappier than Docker workers. - Scale out workers so total parallel capacity exceeds your per-minute arrival rate of runs. If you’re seeing late runs when “upcoming” hits ~140, your workers are probably not keeping up burst-wise. - Reduce per-run overhead: - Load blocks once per run and reuse; avoid repeated
Block.load(...)
calls in tasks. - For hottest secrets/creds, consider env vars via your work pool’s base job template instead of block fetches on every run. - Reduce app logs for these very short runs (less API chatter). API/UI and DB tuning - Postgres pool (server-side): -
PREFECT_SERVER_DATABASE_POOL_SIZE=20
-
PREFECT_SERVER_DATABASE_MAX_OVERFLOW=40
-
PREFECT_SERVER_DATABASE_POOL_TIMEOUT=30
-
PREFECT_SERVER_DATABASE_POOL_RECYCLE=1800
- Ensure Postgres has enough CPU/IO, autovacuum is keeping up, and table/index bloat is under control, especially on
flow_runs
. If the table has grown large from earlier backfill, consider purging old SCHEDULED/late runs you don’t need. - Optionally reduce the default API list size (smaller responses can help UI snappiness): -
PREFECT_SERVER_API_DEFAULT_LIMIT=100
(default is 200) Quick cleanup of old scheduled runs (optional) If you want to quickly shrink “noise” and DB pressure while tuning: - Pause schedules: -
prefect deployment schedule ls
-
prefect deployment schedule pause <deployment_name> <schedule_id>
- Delete stale SCHEDULED runs older than a window (e.g., >30 minutes ago) with a small Python script using the Prefect client. I can share a snippet if you want. Structural option (biggest impact if feasible) - If many of those 80 schedules are just parameter variants of the same job, consider: - A single high-frequency “launcher” schedule that computes the 80 variants and runs them in-process (tasks) or spawns subflows programmatically — this reduces the number of scheduler-managed runs dramatically. Only do this if you don’t need totally independent history/concurrency per variant. A couple of checks/questions to fine-tune next steps - Are you running process workers now? How many concurrent flow runs can your workers handle at peak? - Rough size of the
flow_runs
table and indexes? Any recent growth from earlier backfill? - How chatty are runs (log volume)? If high, we can adjust logging: -
PREFECT_LOGGING_TO_API_BATCH_INTERVAL=5
- Reduce log level for these flows.
If you share your current scheduler env values, worker count/type, and a rough p95 run time and DB CPU during peak, I’ll propose a concrete config (including whether to push loop_seconds further down or adjust batch sizes).
n
@Anthony Goddard didn't read the whole thread in detail, but are you running redis for messaging?
a
@Nate sorry for slow (timezone) reply 🥴 thanks for checking on this, I'm a new user and I've been digging into everything I can w/ Prefect recently. The 👀 on user questions are massively appreciated! 🙏 I am not running redis, but can - I am just tackling this again now. I dropped my scheduled runs down to about 70 (from 140) and things seem stable but a little sluggish, the odd thing is that neither the prefect server host, nor the worker, appear to be breaking a sweat when things start getting slow. I'll switch to redis and revisit some of the other steps in the doc, and I'm going to try to instrument the API calls so I can see where the delays come from.
n
I'm a new user and I've been digging into everything I can w/ Prefect recently
welcome! i mention redis for messaging as its one of the things you'd eventually have to do to scale a prefect server installation! let me know if you develop any questions on that!
a
@Nate thanks! I just switched to redis and keeping an eye on it now. fyi I ran into this https://github.com/PrefectHQ/prefect/issues/18001 after changing one of the config settings Marvin suggested above (
PREFECT_SERVER_API_DEFAULT_LIMIT
)
@Nate 24+hrs and ~10k runs later and it's working great with redis. going to slowly ramp up and also split off API/server nodes
@Marvin what's the downside to just-in-time scheduling, like:
PREFECT_SERVER_SERVICES_SCHEDULER_MIN_SCHEDULED_TIME=PT2M
-
PREFECT_SERVER_SERVICES_SCHEDULER_MIN_RUNS=1
?
m
thought for 112.8 seconds
Great question. Those settings effectively push you toward “just‑in‑time” scheduling: only one future run is created, and not until it’s within ~2 minutes of its scheduled time. That can reduce DB clutter, but there are some trade‑offs. Downsides to JIT scheduling - Less resilience to scheduler/API hiccups: With only a 2‑minute buffer and a single future run, any scheduler outage or delay near run time can result in delayed starts or missed SLAs. A larger look‑ahead and more pre‑created runs give you runway to ride out brief outages. - Reduced UI visibility: You’ll see very few “Upcoming” runs in the UI, which makes it harder to plan, review, or bulk‑manage future runs. - More start‑time jitter: Workers poll on an interval and claim runs when they become due. If runs are created very close to their due time, small polling delays, clock skew, or network/DB latency can lead to runs starting a bit late more often. - Less lead time for autoscaling: If you scale workers or infrastructure based on pending/scheduled workload, the system has very little time to spin up capacity before runs are due, increasing cold‑start latency. - Catch‑up/backfill sensitivity: If the scheduler is down past the small window—or your schedules don’t allow catch‑up—missed intervals may not be created later. A larger window can help ensure you have some already‑created runs when the scheduler returns. - Fewer data points for operational controls: Any processes or policies that benefit from seeing more future runs (e.g., planning maintenance windows, forecasting load) will have less to work with. If you need strict timeliness and better resilience, consider a slightly larger look‑ahead window and a handful of future runs for critical deployments, while keeping JIT for dev or low‑priority workloads. Docs - Schedules concepts - Settings reference If you can share whether you’re self‑hosting vs. Cloud, your worker poll intervals, and whether you autoscale, I can suggest concrete values that balance timeliness and resilience.