Jashaul
10/13/2025, 5:15 PMMarvin
10/13/2025, 5:15 PMMarvin
10/13/2025, 5:21 PMPREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=false
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=false
PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED=false
PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED=false
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED=false
PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=false
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=false
PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED=false
- Tune scheduler throughput if needed (on the “api-services” pod). For lower latency and higher throughput, consider:
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
PREFECT_SERVER_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE=200
PREFECT_SERVER_SERVICES_SCHEDULER_INSERT_BATCH_SIZE=1000
Note: Reducing loop seconds and increasing batch sizes increases DB load. Monitor your DB.
- Settings reference: Prefect settings reference
- Resources: start with requests/limits around 500m–1 CPU and 1–2 GiB RAM per API pod; scale up if API p95 latency or DB CPU climbs.
- Database: your Postgres is the critical bottleneck for high volume. Use a managed Postgres with sufficient CPU/RAM and a connection pooler (e.g., PgBouncer). Scale DB before scaling API if you see DB saturation.
Workers scaling
- Run multiple worker pods per work pool for redundancy and throughput. Start with 2–3 and scale up via HPA.
- Use HPA on workers. CPU-based is simple and works well if your workers are busy polling/launching jobs; custom metrics (e.g., scheduled/backlog runs) are even better if you can add them.
- Example HPA for workers:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: prefect-worker
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: prefect-worker
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- Concurrency control: Cap how many flow runs can run at once per work pool so your cluster and downstream systems aren’t overwhelmed.
- CLI (verified) to set pool-level concurrency:
prefect work-pool set-concurrency-limit <pool-name> <limit>
- For task/flow-level protection across deployments, use global concurrency limits: Global concurrency limits
- Job template hygiene:
- Ensure Kubernetes Jobs have ttlSecondsAfterFinished
to clean up Pods:
spec:
ttlSecondsAfterFinished: 300
- Size job-level CPU/memory to the workload; use node autoscaling and separate node pools if needed.
End-to-end scaling strategyMarvin
10/13/2025, 5:21 PMLOOP_SECONDS
and raise batch sizes if you have backlog and the DB can take it
Helpful docs
- Server Helm and configuration: Server Helm deployment guide
- Settings reference (service env vars, scheduler tuning): Settings reference
- Global concurrency limits: Global concurrency limits
A couple of questions to tailor this:
- Roughly how many flow runs per hour/day are you targeting and what’s the average/95th flow runtime?
- Are flows CPU- or memory-heavy, and do they interact with rate-limited services?
- Are you running one or multiple work pools/queues?
If you can share these, I can suggest concrete replica counts, pool limits, and initial CPU/memory sizing.Jashaul
10/13/2025, 5:42 PMMarvin
10/13/2025, 5:42 PMMarvin
10/13/2025, 5:46 PMPREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=false
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=false
PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED=false
PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED=false
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED=false
PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=false
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=false
PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED=false
- If you see scheduling backlog, you can carefully increase scheduler aggressiveness on the “api-services” pod (watch DB load):
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
PREFECT_SERVER_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE=200
PREFECT_SERVER_SERVICES_SCHEDULER_INSERT_BATCH_SIZE=1000
- DB
- Use managed Postgres with PgBouncer. Plan for high write rate (states, events, optional logs). Start around 8 vCPU / 32 GB RAM, then scale by observed CPU/IOPS.
- Keep DB close to the API (low latency) and monitor connections, locks, and write IOPS.
- Logging to API
- For this volume, sending logs to API can dominate DB writes. Consider turning it off or raising thresholds:
PREFECT_LOGGING_TO_API_ENABLED=false
- Or at least set your flow logger to WARNING/ERROR only.
- Docs: Server Helm deployment and Settings reference
Work pool sizing and concurrency caps
- Target concurrency = ~210; add headroom → set pool concurrency to ~250–300:
prefect work-pool set-concurrency-limit <your-pool-name> 300
- Because you call an external API, add a Global Concurrency Limit (GCL) to protect that dependency. Tag the API-calling tasks/flows and cap to the provider’s limits:
prefect concurrency-limit create --limit <N> external-api
Then tag your flow/tasks with tags=["external-api"]
.
Docs: Global concurrency limits
If using a Kubernetes work pool (Job per flow run)
- Reality check: 30s jobs at 7/sec means ~25k Jobs/hour. This is heavy on the Kubernetes control plane (API server, scheduler, etcd). You can make it work, but watch for control-plane throttling.
- Worker replicas: start 10–20 workers and scale up until your “Scheduled” backlog stays near zero and run start latency is acceptable. Use HPA on workers:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: prefect-worker
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: prefect-worker
minReplicas: 10
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- Job spec hygiene to reduce churn:
- ttlSecondsAfterFinished: 300
for cleanup
- Pre-pull images on nodes (DaemonSet or warm image cache), imagePullPolicy: IfNotPresent
Marvin
10/13/2025, 5:46 PMJashaul
10/13/2025, 5:55 PMMarvin
10/13/2025, 5:55 PMMarvin
10/13/2025, 5:56 PMPREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=false
PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=false
PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED=false
PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED=false
PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED=false
PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=false
PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=false
PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED=false
- Optional scheduler tuning on “api-services” if you see a scheduling backlog (watch DB load):
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
PREFECT_SERVER_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE=200
PREFECT_SERVER_SERVICES_SCHEDULER_INSERT_BATCH_SIZE=1000
- DB
- Managed Postgres + PgBouncer. Start ~8 vCPU/32GB RAM; scale by CPU/IOPS and locks.
- Latency to API should be low. Logs can dominate writes at this volume, so consider turning off logs to API globally:
PREFECT_LOGGING_TO_API_ENABLED=false
- Docs: Server Helm and Settings reference
Work pool and worker pods (Kubernetes type)
- Pool concurrency limit: start at 300 (target is ~210 concurrent, with headroom)
prefect work-pool set-concurrency-limit <your-pool> 300
- Worker replicas: start with 20 workers; use HPA to scale to 60–100 based on CPU. Scale until your “Scheduled” backlog stays near zero and start latency is acceptable.
- Example HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: prefect-worker
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: prefect-worker
minReplicas: 20
maxReplicas: 80
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- Worker polling knobs (only if pick-up latency/backlog persists; increases API/DB load):
PREFECT_WORKER_QUERY_SECONDS=5
PREFECT_WORKER_PREFETCH_SECONDS=30
- Base job template hygiene to reduce churn:
- Ensure cleanup: spec.ttlSecondsAfterFinished: 300
- Use small resource requests for flow pods (e.g., 50–100m CPU, 128–256Mi)
- `imagePullPolicy: IfNotPresent`; pre-pull images (DaemonSet) or use an always-warm, small image
- Set terminationGracePeriodSeconds: 5–10
to speed teardown
- Flow/worker logging
- Keep flow logs minimal (WARNING/ERROR) or rely on external logging. The API/DB will be your bottleneck if every 30s run emits lots of logs.
Kubernetes control plane considerations (critical at 25k Jobs/hour)
- Ensure your cluster’s API server, scheduler, and etcd are sized to handle ~7 Job creates/sec and ~7 Pod terminations/sec.
- Keep per-node pod density reasonable; short-lived pods can stress kubelet. For 210 concurrent pods, use several nodes and avoid pushing node pod limits.
- Use Cluster Autoscaler to add nodes quickly; use a node pool with pre-pulled images if possible.
Optional safety valvesMarvin
10/13/2025, 5:56 PMprefect concurrency-limit create --limit <cap> external-api
Then tag the relevant tasks/flows external-api
.
- If you continue to hit control-plane bottlenecks, consider:
- Aggregating micro-flows into slightly larger batches
- Switching to a Process work pool for these short runs (no per-run pod churn)
Quick summary to try now
- Server: 1 “api-services” + 3 “api-only”, logs to API disabled, scheduler loop 10s if needed
- DB: Managed Postgres + PgBouncer, start ~8 vCPU/32GB
- Work pool: concurrency 300
- Workers: 20 replicas, HPA to 80
- Job template: ttlSecondsAfterFinished=300, small resources, pre-pulled images
If you’d like, I can share example Deployment/Service YAMLs for:
- “api-services” and “api-only” deployments (with the env vars above)
- Worker Deployment + HPA
- A base job template snippet for your Kubernetes work pool
Also, let me know your cluster size and Kubernetes version—control-plane capacity can change the recommended worker count significantly.