<@ULVA73B9P> I'm trying to improve latency to star...
# ask-marvin
g
@Marvin I'm trying to improve latency to start flow runs. Ideally aiming to use prefect for real time inference orchestration. We're using the flow.serve() method to serve the flow and orchestrating with prefect-cloud. And our test show 3-13s latency when running 10 flows back to back, never overlapping - keeping parallelism out of the question for now. IS there any way to improve this?
m
thought for 365.1 seconds
Great question — and your numbers make sense given the defaults. What you’re seeing (3–13s to start a run) is largely explained by how flow.serve works in Prefect 3.x: it starts a lightweight local “runner server” that polls Prefect Cloud for new runs. By default, that poll happens every 10 seconds — so your pickup time will land somewhere between “immediate” and ~10s, plus network jitter and small orchestration overhead. Good news: you can tune this. What to change - Lower the runner poll interval - Set
PREFECT_RUNNER_POLL_FREQUENCY
to 1s (or lower if you can tolerate the extra API calls). - Start your served process with:
Copy code
PREFECT_RUNNER_POLL_FREQUENCY=1 python serve.py
- Expect p50 near ~0.5–1.5s, p95 still subject to network/variance. - If you use workers instead of serve (not required for serve), the equivalent knobs are: -
PREFECT_WORKER_QUERY_SECONDS
(default 10) → lower to 1s -
PREFECT_WORKER_PREFETCH_SECONDS
(default 10) → set to 0–1 if you don’t want prefetch behavior to affect pickup - Example:
Copy code
PREFECT_WORKER_QUERY_SECONDS=1 PREFECT_WORKER_PREFETCH_SECONDS=0 prefect worker start -p <your-pool>
A few practical notes - flow.serve keeps your code “warm” which is good — the main lever for latency here is the poll interval. - If you need consistent sub-second starts, Prefect (as an orchestrator) isn’t ideal as the first hop in the request path. Common pattern for “real-time inference”: - Put a thin web service in front to accept requests and run the hot inference directly. - Use Prefect to orchestrate background/async work (fanout, enrichment, post-processing, retries) via events or API-triggered runs. - Network proximity can matter a bit: keep your served process close to Prefect Cloud (US-based today) to shave off some variability. Helpful references - Runner and worker settings reference (poll, heartbeat, prefetch): Prefect 3 Settings reference - flow.serve overview and patterns: Schedule your code quickly with flow.serve - Worker concepts and health: Workers in Prefect 3 A couple questions to fine-tune recommendations - How are you triggering the runs (UI “Run”, API, webhook, another flow)? - Where is the
flow.serve()
process running (same region as your users / close to Prefect Cloud)? - What P95/P99 start-time target do you need? If you share that, I can suggest an exact setup and the lowest safe poll frequency for your traffic profile.
g
@Marvin the p95 should be 2s
m
thought for 85.6 seconds
Thanks — hitting p95 ≤ 2s is doable with flow.serve, but it requires tightening the polling window and keeping everything hot. Concrete steps to reach p95 ~2s - Lower the serve-runner poll frequency to 1s - flow.serve uses the “runner” which polls Prefect Cloud. Set:
Copy code
PREFECT_RUNNER_POLL_FREQUENCY=1  # default is 10
- Start your serve process like:
Copy code
PREFECT_RUNNER_POLL_FREQUENCY=1 python your_script.py
- With a 1s poll, the pickup time becomes roughly uniform(0–1s) plus network/overhead. In practice, p95 is typically ~1.5–2.0s in a stable environment. - Keep the execution environment hot - Run
flow.serve()
in a long-lived process (no cold container/pod starts). - Avoid infrastructure spin-up (e.g., don’t put serve behind a “start a pod per run” model). - Co-locate the serving process in a region with low RTT to Prefect Cloud. - If you ever switch to workers (not required for serve), mirror the tuning: - Lower query interval:
Copy code
PREFECT_WORKER_QUERY_SECONDS=1
- Disable or reduce prefetch:
Copy code
PREFECT_WORKER_PREFETCH_SECONDS=0
- Start the worker:
Copy code
PREFECT_WORKER_QUERY_SECONDS=1 PREFECT_WORKER_PREFETCH_SECONDS=0 prefect worker start -p <your-pool>
What to expect - With
PREFECT_RUNNER_POLL_FREQUENCY=1
, you should see p50 ~0.5–1.0s and p95 landing close to 1.5–2.0s depending on network jitter and run startup overhead. - If you need even tighter p95, you’ll need extremely low network latency and minimal run initialization (imports, logging, credentials, etc.). Also keep the number of served deployments low or ensure the host is not CPU-constrained. References - Settings for runner/worker poll intervals and heartbeat: Prefect 3 Settings reference - flow.serve overview and patterns: Schedule your code quickly with flow.serve - Worker concepts/health: Workers in Prefect 3 If you can share: - Where your served process runs (region, environment) - How you trigger each run - Any heavy initialization your flow does before the first task …I can suggest a tighter config or pinpoint any remaining sources of >2s variance.
g
@Marvin what's the impact of disabling prefetch? It shouldn't really matter if the flows are trigger in real time. • The served processes are in europe, in an AWS ECS job • runs are triggered with an API call • no initialisation (flow is juste doing a print statement at this stage)
m
thought for 76.8 seconds
Short answer: prefetch won’t help your case — and with flow.serve it doesn’t apply at all. Details - Prefetch is a worker setting. It tells a worker to “claim” scheduled runs up to N seconds before their scheduled time so it can spin up infrastructure in advance. It’s great for cron-like schedules on cold infra (e.g., K8s pods), but: - It does not accelerate immediately-triggered runs. - It does not apply to flows served with
flow.serve()
(served flows use a local runner, not a worker). - If you were using workers, disabling prefetch is fine for real-time API-triggered runs; it won’t slow you down. Your setup and what to do - Served processes in EU on ECS, API-triggered runs, trivial flow body: - The dominant factor is the serve-runner’s polling interval. Set:
Copy code
PREFECT_RUNNER_POLL_FREQUENCY=1
Then run your serve process as a long-lived ECS task:
Copy code
PREFECT_RUNNER_POLL_FREQUENCY=1 python your_script.py
- With 1s polling and low RTT, p95 ~2s is realistic. Your previous 3–13s lines up with the default 10s poll. Other small levers to stay under 2s p95 - Keep the ECS task hot and not CPU-throttled (reserve CPU/memory so Python imports/logging aren’t delayed). - Minimize network RTT to Prefect Cloud (EU → US adds ~100–150ms per round trip). If you can run the serve process closer to Prefect Cloud’s region, you’ll reduce variance. - Avoid heavy logging on start and avoid any external calls before the first task log. If you decide to try workers instead of serve - Prefetch still won’t help immediate triggers. Use faster polling:
Copy code
PREFECT_WORKER_QUERY_SECONDS=1 PREFECT_WORKER_PREFETCH_SECONDS=0 prefect worker start -p <your-pool>
- This gets similar p95 behavior to the serve-runner with 1s polling. References - Settings reference (runner/worker poll, heartbeat, prefetch): Prefect 3 Settings reference - flow.serve overview: Schedule your code quickly with flow.serve - Worker concepts: Workers in Prefect 3 If you flip the runner poll to 1s and still see p95 > 2s, share a quick timing breakdown (API request → run created → first flow log) and I’ll help pinpoint any remaining overhead.
g
@Marvin with PREFECT_WORKER_QUERY_SECONDS=1 PREFECT_WORKER_PREFETCH_SECONDS=0 prefect worker start -p <your-pool> what would the deployment code look like? how to set a deployment to pick this worker? How to tell the worker to keep that flow warm (instead of downloading the code from github for example)?
m
thought for 247.4 seconds
Great questions — here’s how to wire this up with a worker so you get fast pickup without fetching code from GitHub. How a deployment “picks” a worker - You don’t target an individual worker directly. Instead: - Create a work pool (e.g., type “process”). - Optionally create/use a dedicated work queue within that pool. - Start a worker pointed at that pool/queue. - Deploy your flow to that same pool/queue. - Only workers polling that pool/queue will claim the run. If you want a single worker to handle these runs, use a dedicated pool or queue that only that worker listens to. Keep the code warm (no GitHub pulls) - Use a Process work pool and bake/mount your code into the long-lived worker container (your ECS service). - Deploy the flow with a local entrypoint (no
.from_source('<https://github.com/...>')
), so the worker imports code already on disk. - This avoids per-run git clone and avoids cold infra startup — the worker just spawns a subprocess to run the flow. Step-by-step 1) Create a process work pool (one-time)
Copy code
prefect work-pool create "realtime-pool" --type process
2) Start a worker with fast polling and an isolated queue
Copy code
PREFECT_WORKER_QUERY_SECONDS=1 \
PREFECT_WORKER_PREFETCH_SECONDS=0 \
prefect worker start -p realtime-pool -q realtime
3) Deploy your flow to that pool/queue - Simple example using the flow object directly:
Copy code
from prefect import flow

@flow
def my_flow():
    print("hello from realtime worker")

if __name__ == "__main__":
    # Create/update a deployment that routes to your worker's pool/queue
    my_flow.deploy(
        name="realtime-deployment",
        work_pool_name="realtime-pool",
        work_queue_name="realtime",  # must match the worker
        # no 'image' and no 'from_source' GitHub — code lives with the worker
    )
- If you prefer being explicit about the entrypoint resolved on disk (still local, no GitHub):
Copy code
from prefect import flow, get_run_logger
from prefect.flows import Flow

@flow
def my_flow():
    get_run_logger().info("hello from realtime worker")

if __name__ == "__main__":
    Flow.from_source(
        ".",                       # local filesystem (baked/mounted in the worker)
        "app/flows.py:my_flow",    # path:func your worker can import at runtime
    ).deploy(
        name="realtime-deployment",
        work_pool_name="realtime-pool",
        work_queue_name="realtime",
    )
Notes and best practices for 2s p95 - Keep the worker hot: run it as a long-lived ECS service/container with CPU/mem reserved. - Co-locate the worker as close as possible (network-wise) to Prefect Cloud to reduce RTT. - Using ECS “per-run” tasks (ECS work pool) will add several seconds of cold-start — not ideal for 2s p95. A long-lived Process worker in ECS is the right choice here. - With
PREFECT_WORKER_QUERY_SECONDS=1
, expect p50 well under a second and p95 around ~1.5–2s for trivial flows. This aligns with your goal. Docs and references - Work pools and workers: Workers in Prefect 3 - Settings (query_seconds, prefetch_seconds): Prefect 3 Settings reference - Deploy CLI (alternative to code-based deploys): Deploy flows with prefect deploy If you want me to translate this to your exact repo layout (entrypoint paths, Dockerfile/Task Definition for ECS, etc.), share your file structure and how you build the ECS image and I’ll tailor the snippets.
g
@Marvin I'm seeing worse latency compared to the
.serve
is it normal? Results for .serve Min: 1.634s | Avg: 2.272s | Max: 2.944s Results for work pool based solution Min: 2.419s | Avg: 3.248s | Max: 4.946s Why is this? Any way to improve it?
m
thought for 69.7 seconds
Short answer: yes — it’s normal for a worker-based setup to be a bit slower than flow.serve, even with fast polling. Serve tends to be the lowest-latency path. Why workers are slower - Extra orchestration hops: a worker claims the run, merges job variables, sets up the job, then spawns a new Python process to execute the flow. That process import + bootstrapping adds overhead. - New interpreter per run: each run launches a fresh process (imports Prefect, loads your flow entrypoint, sets up logging/clients). Serve’s runner is a lighter path and typically has less per-run setup. - Any template “steps”: if your process work pool’s base job template includes steps (e.g., changing working directory, optional pip installs), each run pays for them. How to close the gap - Lower the poll window further (trade-off: more API calls)
Copy code
PREFECT_WORKER_QUERY_SECONDS=0.5 PREFECT_WORKER_PREFETCH_SECONDS=0 prefect worker start -p realtime-pool -q realtime
This alone can shave ~0.25–0.5s off p95. - Ensure no per-run setup work - Use a Process work pool and mount/bake the code into the long-lived ECS container (no git clone, no pip install in job template). - If you’re using
Flow.from_source(".", "path:func")
, confirm your process pool’s base job template does NOT include
pip_install_requirements
or custom shell steps. - Keep logging minimal at startup and avoid external calls in the flow’s first lines. - Give the worker CPU headroom - Reserve CPU/memory in ECS so the child Python process starts quickly (burstable throttling adds 100s of ms). - Pin the container to a performant instance class. - Keep the queue isolated - Use a dedicated work queue for these runs so your worker isn’t juggling other jobs. Reality check vs serve - Your numbers are consistent with expectations: - serve: Min ~1.6s, Avg ~2.3s, Max ~2.9s - worker: Min ~2.4s, Avg ~3.2s, Max ~4.9s - Serve is usually the best option for p95 ≤ 2s. If that’s the hard SLO, I’d stick with serve unless you need worker features (per-run isolation, different infra types, etc.). If you want, paste your work pool base job template (or confirm you’re on the default Process template) and your deployment snippet; I can help strip any unnecessary steps and see if we can squeeze another ~0.5–1.0s out of the worker path. References - Workers and work pools: Workers in Prefect 3 - Settings (query_seconds, prefetch_seconds): Prefect 3 Settings reference - flow.serve overview: Schedule your code quickly with flow.serve
g
@Marvin I'm using the default template. And doing as you described above. Note that the
Flow.from_source("."), "path.py:func")
is required though. Let's consider the
.serve()
approach. Is it possible to containerise it and run multiple container? How does prefect allocate jobs to different
server
workers since there's not work pool?
m
thought for 60.8 seconds
Totally normal: serve is the lowest-latency path. You can absolutely containerize it and run multiple replicas. How scaling with serve works (no work pool) - Each serve process runs a lightweight “runner” that polls Prefect Cloud for runs on its served deployment(s). - Multiple replicas with the same served deployment name will all poll and “first to claim wins.” Prefect Cloud ensures a run is only claimed once. - Use: - limit: max concurrent runs per process - global_limit: max concurrent runs across all replicas of that served deployment - There’s no sticky routing; under load, distribution is “best effort” based on who polls first. Containerizing and running multiple replicas - Build an image that contains your code and starts the serve loop. Run N replicas (e.g., ECS service with desired count N). Keep them hot. Example serve code (works with Flow.from_source)
Copy code
from prefect.flows import Flow

# Keep your entrypoint explicit; no GitHub pulls
flow = Flow.from_source(".", "path.py:func")

if __name__ == "__main__":
    flow.serve(
        name="realtime-serve",     # shared name across all replicas
        limit=1,                   # 1 run at a time per container (tune as needed)
        global_limit=5,            # cap across all replicas (set to N or N*limit)
        pause_on_shutdown=False,   # don’t pause the deployment if one replica exits
        # triggers=[...],          # optional: add a webhook trigger or others
        # tags=["realtime"],       # optional
    )
Dockerfile sketch
Copy code
FROM python:3.11-slim
WORKDIR /app
ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1
# Add prefect and your deps
RUN pip install --no-cache-dir prefect
COPY . /app
# Prefect Cloud credentials/URL provided via environment in ECS
# Reduce poll interval for low-latency pickup
ENV PREFECT_RUNNER_POLL_FREQUENCY=1
CMD ["python", "serve_entrypoint.py"]
ECS deployment - Run an ECS Service with desired count = number of replicas you want. - Ensure CPU/memory reservations so the process stays hot. - Provide
PREFECT_API_URL
and
PREFECT_API_KEY
env vars. - Keep the task running; don’t spin per-run tasks. Triggering runs - After the first replica calls
serve
, Prefect Cloud shows a “served deployment” named
realtime-serve
. You can: - Trigger from the UI - Call the Prefect API / run_deployment endpoint - Add a webhook trigger (UI or via the
triggers=
parameter) Docs: Create a webhook Routing/throughput tips - Set
PREFECT_RUNNER_POLL_FREQUENCY=1
(or 0.5 if you can tolerate the extra API calls) in every replica. - Use
limit=1
per replica for predictable scaling; set
global_limit
to (#replicas) for 1:1 concurrency. - There’s no work pool in this model; all replicas with the same
name
compete to claim the same served deployment’s runs. - If you need hard partitioning, run multiple served deployments with different names and route to the right one in your trigger logic. References - Flow serve overview: Schedule your code quickly with flow.serve - Settings (runner poll): Prefect 3 Settings reference If you share your ECS task/Service config, I can sanity-check environment vars and health checks.