We are seeing increased latency when starting mult...
# ask-community
m
We are seeing increased latency when starting multiple prefect jobs when submitting on the sub minute level (or running every minute). In Prefect 1 we never experienced this issue, but now are seeing more instances of conflicting runs starting anywhere from 1 to 3 minutes late. Has anyone had success in improving this? We have unlimited concurrency on our work pools
n
hi @Mitch - have you checked logs anywhere to investigate?
m
Hey @Nate - There are no explicit errors, we are running an older prefect version, but the timing of the logs is interesting. We started seeing latency with the 9:03am runs.
Not quite sure where to start, or if updating prefect version would resolve it
That worker is running an old version 2.10.20 This is the only warning I am seeing
Copy code
Sep 03 14:05:41 sradock03 prefect[2672606]: <frozen runpy>:128: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
n
that warning should be able to be safely ignored. are you using cloud or hosting a server? if the latter, what version is it?
👍 1
m
We are using prefect cloud
n
hm ok, next question would be what type of worker is it? @Marvin i have to do something for a bit but can you help investigate this?
We are seeing increased latency when starting multiple prefect jobs when submitting on the sub minute level (or running every minute). In Prefect 1 we never experienced this issue, but now are seeing more instances of conflicting runs starting anywhere from 1 to 3 minutes late. Has anyone had success in improving this? We have unlimited concurrency on our work pools
That worker is running an old version 2.10.20
We are using prefect cloud
m
âś… thought for 562.5 seconds
Got it — I’ll dig in and outline what to check and how to improve startup latency. A couple quick clarifications to make sure we focus the right knobs: - Can you confirm the work pool type (ECS vs Docker vs Process/Kubernetes) and whether your ECS tasks are on EC2 or Fargate? From your notes it sounds like ECS-on-EC2 and a worker on 2.10.20, but I want to be 100% sure. - Roughly how many runs are scheduled per minute, and are multiple deployments hitting the same second? - Do you see the delay primarily in “Submitted → Running” (infra/container startup) or already before that (“Scheduled → Pending/Submitted”)? While you’re away, here’s what I recommend investigating and the likely levers to pull. These are safe to apply incrementally and should reduce 1–3 minute slippage for every-minute or sub-minute cadences. What causes these late starts - Worker poll cadence and prefetch window: If a run is created at 000000 and the worker polls every 10–15s, worst-case it may not see it for one poll cycle. If infra startup (ECS task launch, image pull, app boot) adds 30–90s, you’ll see 1–3 minute “late” starts. - Submission throughput: One worker can submit only so many runs at once; if you drop many runs simultaneously, submissions queue briefly. - ECS startup time on EC2: Capacity scaling, image pulls, and cold hosts are the usual sources of 60–120s startup. - Hidden concurrency: Pool-wide may be “unlimited” but per-deployment or tag concurrency limits (or overlapping schedules) can still defer runs. Immediate tuning (no code changes required) - Lower worker polling interval and increase prefetch window - This reduces jitter and starts infra ahead of the scheduled time so it’s warm when the minute ticks over. - On the worker host/container, set:
Copy code
export PREFECT_WORKER_QUERY_SECONDS=5
    export PREFECT_WORKER_PREFETCH_SECONDS=60
    # optional
    export PREFECT_WORKER_HEARTBEAT_SECONDS=10
- Then restart your worker:
Copy code
prefect worker start -p <your-work-pool-name> -n <your-worker-name>
- Tradeoff: Lower poll intervals increase API chatter; keep it to a handful of workers. - Run 2–3 workers against the same pool - This parallelizes submissions when many runs trigger at the same time. - Keep the same settings above on each worker. - Verify there’s no other concurrency bottleneck - In Cloud UI: check the deployment(s) for per-deployment concurrency limits and tag-based concurrency rules. - Work pool shows “unlimited”, but tag/deployment constraints can still defer runs. - Confirm current settings on the worker - On the worker machine/container:
Copy code
prefect config view --show-defaults
- Look for the “worker” section (query_seconds, prefetch_seconds, heartbeat_seconds). Share that snippet if you can. ECS-on-EC2 specific improvements - Keep capacity warm - Ensure your EC2 Auto Scaling Group has headroom or a warm pool so new tasks don’t wait for instance scale-up. - Use capacity providers with target capacity > 0 to keep a baseline of ready capacity. - Reduce or skip image pulls - Use ECS Agent setting
ECS_IMAGE_PULL_BEHAVIOR=prefer-cached
on your EC2 hosts so tasks reuse cached images. - Keep images slim and immutable tags stable to avoid re-pulls; consider a small “image warmer” workload to keep your image cached on hosts. - Reduce container boot time - Minimize image size, entrypoint work, dependency install on start, and any external calls (e.g., boto credential fetch patterns) during cold start. Upgrade recommendation - Your worker is on 2.10.20, which is quite old. There have been numerous improvements to worker polling and config since then. I recommend: - Short term: Upgrade to the latest 2.x worker to keep your existing deployments stable. - Better: Move to Prefect 3.x workers when feasible — 3.x exposes clear worker settings like query_seconds and prefetch_seconds and includes reliability/performance improvements.
- Settings reference: - Prefect 3 settings reference - Worker healthchecks (3.x) - 2.x docs for context: - Work pools and workers (2.x) - ECS guide (2.x) Helpful artifacts to confirm where the time is spent - Prefect UI run timeline around the minute boundary (Scheduled → Pending/Submitted → Running). - Worker logs around that time window (look for polling, prefetch hits, and submission). - ECS events for a representative run (task created, provisioning, container pulled, running) and CloudWatch metrics on image pull/launch latency. If you can share: - The work pool type + worker start command/env for that 2.10.20 worker - A sample run ID that started ~1–3 minutes late - The output of
prefect config view --show-defaults
for the worker “worker.*” section …I can pinpoint whether the slippage is primarily worker polling, submission throughput, or ECS startup, and give you an exact set of changes.
m
@Marvin We are using docker worker and used
prefect worker start -t docker -p ny4-worker-a --install-policy if-not-present
to start the worker. We are going to upgrade the prefect version and add another worker (we have two as of now) to help reduce this. Here are the worker env variables PREFECT_WORKER_HEARTBEAT_SECONDS='30.0' (from defaults) PREFECT_WORKER_PREFETCH_SECONDS='10.0' (from defaults) PREFECT_WORKER_QUERY_SECONDS='10.0' (from defaults)
m
âś… thought for 80.5 seconds
Perfect — thanks for confirming. So this pool is using the Docker worker, started like:
Copy code
prefect worker start -t docker -p ny4-worker-a --install-policy if-not-present
And your current worker settings are defaults: - PREFECT_WORKER_QUERY_SECONDS=10 - PREFECT_WORKER_PREFETCH_SECONDS=10 - PREFECT_WORKER_HEARTBEAT_SECONDS=30 Given your every-minute/sub-minute cadence and bursts at the same second, the biggest wins for reducing 1–3 minute slippage are: 1) Tighten polling and increase prefetch - This reduces pickup jitter and gives the worker time to get containers started before the scheduled minute. - Try:
Copy code
export PREFECT_WORKER_QUERY_SECONDS=5
  export PREFECT_WORKER_PREFETCH_SECONDS=60
  export PREFECT_WORKER_HEARTBEAT_SECONDS=10
Then restart each worker:
Copy code
PREFECT_WORKER_QUERY_SECONDS=5 PREFECT_WORKER_PREFETCH_SECONDS=60 PREFECT_WORKER_HEARTBEAT_SECONDS=10 \
  prefect worker start -t docker -p ny4-worker-a -n ny4-worker-a-1
Notes: - A larger prefetch window can start runs slightly early (by up to the prefetch value). If that’s not acceptable, use 20–30s instead of 60s. - Lower polling means more frequent API calls; keep it reasonable (3–5s is a good target). 2) Add more workers for submission parallelism - You’re adding another worker — good. I’d run at least 3–4 workers for a pool that fans out several runs at the same second. This spreads submissions and container launches. - Keep the same
PREFECT_WORKER_*
settings on each worker. 3) Make Docker startup as fast as possible - Ensure your deployment’s Docker infrastructure uses no/limited pulls at runtime: - Set image_pull_policy to IF_NOT_PRESENT (or NEVER if you manage cache out-of-band) so each run doesn’t re-pull. - Pre-pull the image on the worker host(s) and keep it warm:
Copy code
docker pull your-registry/your-image:tag
- Keep tags immutable so the Docker cache is actually reusable. - Keep images slim and reduce boot work in entrypoints to cut cold-start time. 4) Stagger schedules to avoid a single-second “thundering herd” - If multiple deployments fire at :00, add 1–10s offset or jitter to spread load across the minute. 5) Double-check there aren’t hidden concurrency limits - Pool shows “unlimited”, but per-deployment or tag concurrency limits can still delay runs. Quick check those in the UI. If you’re upgrading Prefect: - Short term: upgrading your 2.10.20 Docker workers to a recent 2.x will help with worker reliability and settings behavior. - Medium term: 3.x has ongoing improvements and clearer worker settings. If/when you move to 3.x, remember: - Use
prefect deploy
(not
prefect deployment build
) and
flow.from_source(...).deploy(...)
in code. - Settings reference: 3.x Settings and worker healthchecks: 3.x Worker healthchecks - 2.x work pools/workers docs (for context): 2.x Work pools and workers Would you be okay with runs starting slightly early (e.g., up to 30–60s) if it eliminates the late starts? If yes, we’ll keep PREFETCH at 60; if not, we’ll tune it to ~20–30s and lean more on worker count and image caching. If you can share: - The deployment’s Docker image and pull policy settings - Whether multiple deployments hit the same second - A representative run where “Scheduled → Submitted/Running” shows the delay …I can fine-tune the exact values (prefetch/polling and worker count) for your load profile.
n
seems unlikely to me that changing
PREFECT_WORKER_QUERY_SECONDS
is the right move here
unless you're scheduling work very quickly and wanting it to run almost real time, i bet something else is causing the lag
i guess maybe the pull policy?
m
the fastest we scheudule is every minute, then when they are late they lock each other out. I think upgrading version then adding one more worker we can see if that helps? The pull policy should be good since it's doing pull if not present (the image is present). We really only have the issue near start of day or at 9am or so when multiple jobs kick off at once. Could be some latency with the git clone as well through VPN and such which contribute to initialization?
👍 1
n
I think upgrading version then adding one more worker we can see if that helps?
im not sure that adding a worker would help. is the worker having trouble submitting containers in time or are the containers slow to begin their flow run processes? if the latter, adding the worker doesn't seem like itd help
Could be some latency with the git clone as well through VPN and such which contribute to initialization?
yea could be. hmm
m
Most of the time we aren't seeing the the late runs (even for this same job later in the day), just for these instances. The timing of submission is just interesting - removed account and workspace from below
Copy code
{
  "id": "068b84aa-c31a-7d34-8000-58bae7094837",
  "account": redacted,
  "event": "prefect.flow-run.Late",
  "occurred": "2025-09-03T14:03:24.194Z",
  "payload": {
    "intended": {
      "from": "SCHEDULED",
      "to": "SCHEDULED"
    },
    "initial_state": {
      "type": "SCHEDULED",
      "name": "Scheduled",
      "message": "Flow run scheduled",
      "timestamp": "2025-09-03T14:00:13.348186+00:00",
      "id": "068b849e-d591-7f36-8000-99fd248e2de8"
    },
    "validated_state": {
      "type": "SCHEDULED",
      "name": "Late",
      "timestamp": "2025-09-03T14:03:24.194071+00:00",
      "id": "068b84aa-c31a-7d34-8000-58bae7094837"
    }
  },
  "received": "2025-09-03T14:03:24.437Z",
  "related": [
    {
      "prefect.resource.id": "prefect.flow.7097c204-5aa4-467c-ab0b-00b2bde718c0",
      "prefect.resource.role": "flow",
      "prefect.resource.name": "executions-strategy-intraday"
    },
    {
      "prefect.resource.id": "prefect.deployment.965785f9-bcb5-436f-bca5-591aecb78f01",
      "prefect.resource.role": "deployment",
      "prefect.resource.name": "executions-strategy-intraday",
      "prefect.deployment.version-id": "068b7425-f2b4-7bf5-8000-2cbc10c14534",
      "prefect.deployment.version-type": "prefect:simple",
      "prefect.deployment.version": ""
    },
    {
      "prefect.resource.id": "prefect.work-queue.8868d7c0-be7c-432e-bbfe-6c84ab546d5c",
      "prefect.resource.role": "work-queue",
      "prefect.resource.name": "default"
    },
    {
      "prefect.resource.id": "prefect.work-pool.a39b04c4-0c5f-46c6-aaf6-2d1b8029206f",
      "prefect.resource.role": "work-pool",
      "prefect.resource.name": "ny4-worker-a",
      "prefect.work-pool.type": "docker"
    },
    {
      "prefect.resource.id": "prefect.tag.auto-scheduled",
      "prefect.resource.role": "tag"
    },
    {
      "prefect.resource.id": "prefect.schedule.a016c5aa-7b0c-4b50-8505-4a794b3df5c6",
      "prefect.resource.role": "creator",
      "prefect.resource.name": "CronSchedule"
    }
  ],
  "resource": {
    "prefect.resource.id": "prefect.flow-run.068b849e-d591-7e93-8000-c30ec2d987e1",
    "prefect.resource.name": "rainbow-beetle",
    "prefect.run-count": "0",
    "prefect.state-message": "",
    "prefect.state-name": "Late",
    "prefect.state-timestamp": "2025-09-03T14:03:24.194071+00:00",
    "prefect.state-type": "SCHEDULED"
  },
  "workspace": redacted
}
Scheduled vs validated vs occurred
Any thoughts here @Nate based on the above on scheduled vs validated vs occurred?
n
hi have you opened an issue where you’ve documented this? it’s hard to keep track of all the slack threads
👍 1
issues / discussions are the best place to make noise if you feel there’s a bug
m
No, I will look into that, just wondering if this was an us thing in terms of configuration or something else