<@ULVA73B9P> I have prefect deployment on kubernet...
# marvin-ai
j
@Marvin I have prefect deployment on kubernetes. What should my server pod and worker pod scaling be for high volume flow execution?
m
thought for 372.0 seconds
Great question. Short answer: there isn’t a one-size-fits-all number. You’ll want to scale your API (server) and workers independently based on load, with Kubernetes HPA and Prefect concurrency controls providing the guardrails. Here’s a proven pattern for high-volume setups in Prefect 3.x on Kubernetes. Assumptions - You’re self-hosting Prefect Server 3.x on Kubernetes and using Kubernetes workers. Server (API) scaling - Replicas: run 2–3 API replicas for HA. The API is stateless; HA concerns are mostly the database and background services. - Background services: for safety and predictable load, run only one replica with services enabled (scheduler, foreman, event persister, etc.), and disable services on the other API replicas. This avoids duplicate scheduling and reduces DB contention. - Create two deployments that use the same image: - “api-only” (2–3 replicas): services disabled - “api-services” (1 replica): services enabled - Example env vars to disable services on the “api-only” deployment:
Copy code
PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=false
    PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
    PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=false
    PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
    PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED=false
    PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED=false
    PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED=false
    PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=false
    PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=false
    PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED=false
- Tune scheduler throughput if needed (on the “api-services” pod). For lower latency and higher throughput, consider:
Copy code
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
    PREFECT_SERVER_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE=200
    PREFECT_SERVER_SERVICES_SCHEDULER_INSERT_BATCH_SIZE=1000
Note: Reducing loop seconds and increasing batch sizes increases DB load. Monitor your DB. - Settings reference: Prefect settings reference - Resources: start with requests/limits around 500m–1 CPU and 1–2 GiB RAM per API pod; scale up if API p95 latency or DB CPU climbs. - Database: your Postgres is the critical bottleneck for high volume. Use a managed Postgres with sufficient CPU/RAM and a connection pooler (e.g., PgBouncer). Scale DB before scaling API if you see DB saturation. Workers scaling - Run multiple worker pods per work pool for redundancy and throughput. Start with 2–3 and scale up via HPA. - Use HPA on workers. CPU-based is simple and works well if your workers are busy polling/launching jobs; custom metrics (e.g., scheduled/backlog runs) are even better if you can add them. - Example HPA for workers:
Copy code
apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: prefect-worker
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: prefect-worker
      minReplicas: 2
      maxReplicas: 20
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 60
- Concurrency control: Cap how many flow runs can run at once per work pool so your cluster and downstream systems aren’t overwhelmed. - CLI (verified) to set pool-level concurrency:
Copy code
prefect work-pool set-concurrency-limit <pool-name> <limit>
- For task/flow-level protection across deployments, use global concurrency limits: Global concurrency limits - Job template hygiene: - Ensure Kubernetes Jobs have
ttlSecondsAfterFinished
to clean up Pods:
Copy code
spec:
      ttlSecondsAfterFinished: 300
- Size job-level CPU/memory to the workload; use node autoscaling and separate node pools if needed. End-to-end scaling strategy
1) Start small, measure, then scale: - API: 2–3 “api-only” + 1 “api-services” - Workers: 2–3 per work pool with HPA up to 10–20 - DB: provision for expected write/read throughput; add PgBouncer 2) Cap concurrency: - Work pool concurrency limits to match cluster capacity - Global concurrency limits for shared resources 3) Autoscale: - Worker HPA drives job launch throughput - Cluster autoscaler provides nodes as jobs scale 4) Tune scheduler only if needed: - Lower
LOOP_SECONDS
and raise batch sizes if you have backlog and the DB can take it Helpful docs - Server Helm and configuration: Server Helm deployment guide - Settings reference (service env vars, scheduler tuning): Settings reference - Global concurrency limits: Global concurrency limits A couple of questions to tailor this: - Roughly how many flow runs per hour/day are you targeting and what’s the average/95th flow runtime? - Are flows CPU- or memory-heavy, and do they interact with rate-limited services? - Are you running one or multiple work pools/queues? If you can share these, I can suggest concrete replica counts, pool limits, and initial CPU/memory sizing.
j
@Marvin I am targeting 25000 flow runs per hour, with average flow runtime of 30sec. Flow is not CPU or Memory heavy but interacts with external service via API. All these flow runs are handled by one work pool with default queue
m
thought for 229.9 seconds
Ambitious target! 25,000 flow runs/hour at ~30s each is about 7 runs/sec and ~210 concurrent runs on average. Here’s how I’d size and tune for that in Prefect 3.x on Kubernetes, plus where to add safety valves for your external API. First, a key decision: what work pool type are you using? - Kubernetes work pool (each flow run is a new Kubernetes Job/Pod) - Process work pool (flows execute inside long‑running worker pods) For 30-second flows, Process workers are generally more efficient because they avoid per-run pod churn. If you must use Kubernetes Jobs, you’ll need to make the control plane very happy and limit logs. Server (API) pods - Replicas - 1 “api-services” pod (scheduler/foreman/etc. enabled) - 2–4 “api-only” pods (services disabled) behind a Service for HA - Disable services on “api-only” pods to prevent duplicate scheduling and reduce DB load:
Copy code
PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=false
  PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
  PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=false
  PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
  PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED=false
  PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED=false
  PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED=false
  PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=false
  PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=false
  PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED=false
- If you see scheduling backlog, you can carefully increase scheduler aggressiveness on the “api-services” pod (watch DB load):
Copy code
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
  PREFECT_SERVER_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE=200
  PREFECT_SERVER_SERVICES_SCHEDULER_INSERT_BATCH_SIZE=1000
- DB - Use managed Postgres with PgBouncer. Plan for high write rate (states, events, optional logs). Start around 8 vCPU / 32 GB RAM, then scale by observed CPU/IOPS. - Keep DB close to the API (low latency) and monitor connections, locks, and write IOPS. - Logging to API - For this volume, sending logs to API can dominate DB writes. Consider turning it off or raising thresholds:
Copy code
PREFECT_LOGGING_TO_API_ENABLED=false
- Or at least set your flow logger to WARNING/ERROR only. - Docs: Server Helm deployment and Settings reference Work pool sizing and concurrency caps - Target concurrency = ~210; add headroom → set pool concurrency to ~250–300:
Copy code
prefect work-pool set-concurrency-limit <your-pool-name> 300
- Because you call an external API, add a Global Concurrency Limit (GCL) to protect that dependency. Tag the API-calling tasks/flows and cap to the provider’s limits:
Copy code
prefect concurrency-limit create --limit <N> external-api
Then tag your flow/tasks with
tags=["external-api"]
. Docs: Global concurrency limits If using a Kubernetes work pool (Job per flow run) - Reality check: 30s jobs at 7/sec means ~25k Jobs/hour. This is heavy on the Kubernetes control plane (API server, scheduler, etcd). You can make it work, but watch for control-plane throttling. - Worker replicas: start 10–20 workers and scale up until your “Scheduled” backlog stays near zero and run start latency is acceptable. Use HPA on workers:
Copy code
apiVersion: autoscaling/v2
  kind: HorizontalPodAutoscaler
  metadata:
    name: prefect-worker
  spec:
    scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: prefect-worker
    minReplicas: 10
    maxReplicas: 50
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
- Job spec hygiene to reduce churn: -
ttlSecondsAfterFinished: 300
for cleanup - Pre-pull images on nodes (DaemonSet or warm image cache),
imagePullPolicy: IfNotPresent
- Reasonable requests/limits to avoid scheduling thrash - Expect to tune up to dozens of workers if individual worker launch throughput is your bottleneck; measure and scale. If using a Process work pool (recommended for short runs) - Run long‑lived worker pods that execute flows in‑process; no per-run pod creation. - To sustain ~250 concurrent flows, run ~250 worker pods (each handles one flow at a time). This is many pods, but they’re stable and far easier on the control plane than 25k Jobs/hour. - Use a Deployment with HPA or manual scaling; the pool concurrency limit (300) still caps total concurrency. - This pattern typically yields lower latency and higher reliability for short flows. External API protection - Add backoff/retries on API calls. - Set tag-based GCL to the provider’s concurrent/requests-per-second limits as above. - Consider sharding runs by multiple tags/limits if the API has different rate limit buckets. Monitoring and scale-up signals - Prefect metrics to watch: number of Scheduled runs, average time to start, API p95 latency, DB CPU/IOPS, and worker error rates. - Kubernetes: API server throttling, scheduler queue depth, etcd latency; Pod start latency if using Jobs. - Scale API pods and DB when API latency grows; scale workers when scheduled backlog grows; raise pool concurrency only when the cluster and external API can handle more. Quick start recommendation for your target - Server - 1 api-services pod + 3 api-only pods; HPA on api-only to 5–6 if needed - Postgres ~8 vCPU/32 GB + PgBouncer; disable logs to API or set WARNING+ - Concurrency - Work pool concurrency limit: 300 - GCL tag for external API: set to provider-safe level - Workers - If Kubernetes pool: 20 worker pods to start (HPA to 50); measure and increase until backlog ~0 - If Process pool: ~250 worker pods (HPA or manual), no Job churn If you can share: - Your work pool type (Kubernetes vs Process) - Any external API rate limits (RPS/concurrency) …I can translate this into concrete manifests (Deployments/HPAs), Prefect settings, and a right-sized starting point for your cluster.
j
@Marvin I'm using kubernetes work pool type and no external API rate limit
m
thought for 57.7 seconds
Got it. With Kubernetes work pool and 25k runs/hour (≈7/sec, ~210 concurrent), here’s a concrete starting point that’s worked well for high-volume, short-lived flows. The biggest constraint at this scale is your Kubernetes control plane (Job/Pod churn), so the plan focuses on minimizing control-plane pressure and keeping Prefect/DB steady. Server (API) pods - Replicas - 1 “api-services” pod (scheduler/foreman/etc. enabled) - 3 “api-only” pods (services disabled) behind the same Service - Add HPA on “api-only” to 6 if API latency grows; keep “api-services” at 1 - Disable services on the “api-only” pods to prevent duplicate scheduling:
Copy code
PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=false
  PREFECT_SERVER_SERVICES_FOREMAN_ENABLED=false
  PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=false
  PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
  PREFECT_SERVER_SERVICES_TRIGGERS_ENABLED=false
  PREFECT_SERVER_SERVICES_LATE_RUNS_ENABLED=false
  PREFECT_SERVER_SERVICES_PAUSE_EXPIRATIONS_ENABLED=false
  PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=false
  PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=false
  PREFECT_SERVER_SERVICES_CANCELLATION_CLEANUP_ENABLED=false
- Optional scheduler tuning on “api-services” if you see a scheduling backlog (watch DB load):
Copy code
PREFECT_SERVER_SERVICES_SCHEDULER_LOOP_SECONDS=10
  PREFECT_SERVER_SERVICES_SCHEDULER_DEPLOYMENT_BATCH_SIZE=200
  PREFECT_SERVER_SERVICES_SCHEDULER_INSERT_BATCH_SIZE=1000
- DB - Managed Postgres + PgBouncer. Start ~8 vCPU/32GB RAM; scale by CPU/IOPS and locks. - Latency to API should be low. Logs can dominate writes at this volume, so consider turning off logs to API globally:
Copy code
PREFECT_LOGGING_TO_API_ENABLED=false
- Docs: Server Helm and Settings reference Work pool and worker pods (Kubernetes type) - Pool concurrency limit: start at 300 (target is ~210 concurrent, with headroom)
Copy code
prefect work-pool set-concurrency-limit <your-pool> 300
- Worker replicas: start with 20 workers; use HPA to scale to 60–100 based on CPU. Scale until your “Scheduled” backlog stays near zero and start latency is acceptable. - Example HPA:
Copy code
apiVersion: autoscaling/v2
  kind: HorizontalPodAutoscaler
  metadata:
    name: prefect-worker
  spec:
    scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: prefect-worker
    minReplicas: 20
    maxReplicas: 80
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
- Worker polling knobs (only if pick-up latency/backlog persists; increases API/DB load):
Copy code
PREFECT_WORKER_QUERY_SECONDS=5
  PREFECT_WORKER_PREFETCH_SECONDS=30
- Base job template hygiene to reduce churn: - Ensure cleanup:
spec.ttlSecondsAfterFinished: 300
- Use small resource requests for flow pods (e.g., 50–100m CPU, 128–256Mi) - `imagePullPolicy: IfNotPresent`; pre-pull images (DaemonSet) or use an always-warm, small image - Set
terminationGracePeriodSeconds: 5–10
to speed teardown - Flow/worker logging - Keep flow logs minimal (WARNING/ERROR) or rely on external logging. The API/DB will be your bottleneck if every 30s run emits lots of logs. Kubernetes control plane considerations (critical at 25k Jobs/hour) - Ensure your cluster’s API server, scheduler, and etcd are sized to handle ~7 Job creates/sec and ~7 Pod terminations/sec. - Keep per-node pod density reasonable; short-lived pods can stress kubelet. For 210 concurrent pods, use several nodes and avoid pushing node pod limits. - Use Cluster Autoscaler to add nodes quickly; use a node pool with pre-pulled images if possible. Optional safety valves
- Even without vendor rate limits, your network or upstream may saturate. If you see errors/timeouts, add a Global Concurrency Limit tag to the API-calling steps to cap concurrency in one place:
Copy code
prefect concurrency-limit create --limit <cap> external-api
Then tag the relevant tasks/flows
external-api
. - If you continue to hit control-plane bottlenecks, consider: - Aggregating micro-flows into slightly larger batches - Switching to a Process work pool for these short runs (no per-run pod churn) Quick summary to try now - Server: 1 “api-services” + 3 “api-only”, logs to API disabled, scheduler loop 10s if needed - DB: Managed Postgres + PgBouncer, start ~8 vCPU/32GB - Work pool: concurrency 300 - Workers: 20 replicas, HPA to 80 - Job template: ttlSecondsAfterFinished=300, small resources, pre-pulled images If you’d like, I can share example Deployment/Service YAMLs for: - “api-services” and “api-only” deployments (with the env vars above) - Worker Deployment + HPA - A base job template snippet for your Kubernetes work pool Also, let me know your cluster size and Kubernetes version—control-plane capacity can change the recommended worker count significantly.