<@ULVA73B9P> I’m running a self-hosted Prefect dep...
# ask-marvin
a
@Marvin I’m running a self-hosted Prefect deployment on Kubernetes, using a process-type worker. The setup looks like this: • Worker deployment named
prefect-process-worker-rs
• 1 replica initially, autoscaled by an HPA based on CPU usage • Each worker runs with
--limit 3
to restrict concurrency • Connected to a single work pool (
process-worker
) and a single work queue (
default
) The issue I’m facing is with horizontal scaling and run distribution: • When I launch, say, 100 flow runs, the first worker pod picks up 3 runs (as expected) and starts working. • CPU rises → the HPA spawns new worker pods (worker-2, worker-3, etc.). • However, these new workers stay idle and don’t pick up any of the remaining queued runs. • The runs sitting in the work queue remain assigned to the first worker or stay pending, even though the new pods are healthy and polling. • Only when I trigger new flow runs do those new pods start taking work. So it seems Prefect doesn’t immediately recognize new worker pods joining the pool — or maybe already-queued runs aren’t redistributed when new workers appear. Is there a recommended way to make new workers pick up queued runs faster ?
m
thought for 463.7 seconds
Thanks for the detailed setup — that really helps. What you’re seeing is consistent with how Prefect 3 workers claim work: - When a worker polls a work queue, it “leases” deliveries from the server. Those deliveries are assigned to that worker and are not redistributed to other workers until the lease expires or the worker releases them. - By default, workers both poll frequently and prefetch upcoming runs. That means your first pod can lease multiple runs ahead of time; when the HPA adds pods, those runs are still leased to the original worker so the new pods will idle until new runs arrive or leases expire. You can make new pods pick up queued runs faster by reducing (or disabling) prefetch and tightening the polling interval. Practical steps 1) Turn down prefetch - Set
PREFECT_WORKER_PREFETCH_SECONDS=0
on your worker pods so they only claim runs that are ready to start immediately, instead of reserving upcoming runs. This prevents the first worker from “owning” a backlog that newly spawned pods can’t see. 2) Increase polling frequency (with care) - Lower
PREFECT_WORKER_QUERY_SECONDS
(e.g., 2–5 seconds) so new pods notice available work quickly. This increases API traffic, so find a balance that works for your cluster. 3) Pre-warm some replicas - HPA reacts after CPU rises on the first pod. If you typically burst to high concurrency, consider a minimum replica count > 1 so work is distributed immediately across multiple pods rather than waiting for scale-out. 4) Double-check pool/queue limits - Ensure there isn’t a global work pool or tag-based concurrency limit that would cap total concurrent runs at 3; that would also keep new pods idle even if they’re healthy. Kubernetes example (env on your worker Deployment)
Copy code
env:
  - name: PREFECT_API_URL
    value: http://<your-prefect-api>
  - name: PREFECT_API_KEY
    valueFrom:
      secretKeyRef:
        name: prefect-api
        key: key
  - name: PREFECT_WORKER_PREFETCH_SECONDS
    value: "0"
  - name: PREFECT_WORKER_QUERY_SECONDS
    value: "3"
Why this helps - With prefetch disabled (or shortened), the first worker will not lease a large backlog; as soon as new pods come online, they’ll see those unclaimed runs and start them. - Faster polling reduces the time-to-pickup for newly available runs. Docs for reference - Workers and how they poll/claim runs: Workers (Prefect 3.x) - Work queues API shape (useful to understand deliveries/leases): Work queue API A couple of clarifying questions to make sure there aren’t other blockers: - Do you have any work pool/queue-level or tag-based concurrency limits configured? - Roughly how long do the individual flow runs take, and are they scheduled “for now” or slightly in the future? If you’d like, I can suggest HPA/KEDA patterns to scale on queue backlog (instead of CPU) so scale-out happens before the first worker gets saturated.