We are seeing increased latency when starting multiple prefe Prefect Community #ask-community

We are seeing increased latency when starting mult...

Mitch

09/03/2025, 2:24 PM

We are seeing increased latency when starting multiple prefect jobs when submitting on the sub minute level (or running every minute). In Prefect 1 we never experienced this issue, but now are seeing more instances of conflicting runs starting anywhere from 1 to 3 minutes late. Has anyone had success in improving this? We have unlimited concurrency on our work pools

Nate

09/03/2025, 2:29 PM

hi @Mitch - have you checked logs anywhere to investigate?

Mitch

09/03/2025, 2:52 PM

Hey @Nate - There are no explicit errors, we are running an older prefect version, but the timing of the logs is interesting. We started seeing latency with the 9:03am runs.

Mitch

09/03/2025, 2:52 PM

Not quite sure where to start, or if updating prefect version would resolve it

Mitch

09/03/2025, 3:08 PM

That worker is running an old version 2.10.20 This is the only warning I am seeing

Copy code

Sep 03 14:05:41 sradock03 prefect[2672606]: <frozen runpy>:128: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour

Nate

09/03/2025, 3:09 PM

that warning should be able to be safely ignored. are you using cloud or hosting a server? if the latter, what version is it?

👍 1

Mitch

09/03/2025, 3:10 PM

We are using prefect cloud

Nate

09/03/2025, 3:12 PM

hm ok, next question would be what type of worker is it? @Marvin i have to do something for a bit but can you help investigate this?

We are seeing increased latency when starting multiple prefect jobs when submitting on the sub minute level (or running every minute). In Prefect 1 we never experienced this issue, but now are seeing more instances of conflicting runs starting anywhere from 1 to 3 minutes late. Has anyone had success in improving this? We have unlimited concurrency on our work pools

That worker is running an old version 2.10.20

We are using prefect cloud

Marvin

09/03/2025, 3:12 PM

✅ thought for 562.5 seconds

Marvin

09/03/2025, 3:22 PM

Got it — I’ll dig in and outline what to check and how to improve startup latency. A couple quick clarifications to make sure we focus the right knobs: - Can you confirm the work pool type (ECS vs Docker vs Process/Kubernetes) and whether your ECS tasks are on EC2 or Fargate? From your notes it sounds like ECS-on-EC2 and a worker on 2.10.20, but I want to be 100% sure. - Roughly how many runs are scheduled per minute, and are multiple deployments hitting the same second? - Do you see the delay primarily in “Submitted → Running” (infra/container startup) or already before that (“Scheduled → Pending/Submitted”)? While you’re away, here’s what I recommend investigating and the likely levers to pull. These are safe to apply incrementally and should reduce 1–3 minute slippage for every-minute or sub-minute cadences. What causes these late starts - Worker poll cadence and prefetch window: If a run is created at 000000 and the worker polls every 10–15s, worst-case it may not see it for one poll cycle. If infra startup (ECS task launch, image pull, app boot) adds 30–90s, you’ll see 1–3 minute “late” starts. - Submission throughput: One worker can submit only so many runs at once; if you drop many runs simultaneously, submissions queue briefly. - ECS startup time on EC2: Capacity scaling, image pulls, and cold hosts are the usual sources of 60–120s startup. - Hidden concurrency: Pool-wide may be “unlimited” but per-deployment or tag concurrency limits (or overlapping schedules) can still defer runs. Immediate tuning (no code changes required) - Lower worker polling interval and increase prefetch window - This reduces jitter and starts infra ahead of the scheduled time so it’s warm when the minute ticks over. - On the worker host/container, set:

Copy code

export PREFECT_WORKER_QUERY_SECONDS=5
    export PREFECT_WORKER_PREFETCH_SECONDS=60
    # optional
    export PREFECT_WORKER_HEARTBEAT_SECONDS=10

- Then restart your worker:

Copy code

prefect worker start -p <your-work-pool-name> -n <your-worker-name>

- Tradeoff: Lower poll intervals increase API chatter; keep it to a handful of workers. - Run 2–3 workers against the same pool - This parallelizes submissions when many runs trigger at the same time. - Keep the same settings above on each worker. - Verify there’s no other concurrency bottleneck - In Cloud UI: check the deployment(s) for per-deployment concurrency limits and tag-based concurrency rules. - Work pool shows “unlimited”, but tag/deployment constraints can still defer runs. - Confirm current settings on the worker - On the worker machine/container:

Copy code

prefect config view --show-defaults

- Look for the “worker” section (query_seconds, prefetch_seconds, heartbeat_seconds). Share that snippet if you can. ECS-on-EC2 specific improvements - Keep capacity warm - Ensure your EC2 Auto Scaling Group has headroom or a warm pool so new tasks don’t wait for instance scale-up. - Use capacity providers with target capacity > 0 to keep a baseline of ready capacity. - Reduce or skip image pulls - Use ECS Agent setting

ECS_IMAGE_PULL_BEHAVIOR=prefer-cached

on your EC2 hosts so tasks reuse cached images. - Keep images slim and immutable tags stable to avoid re-pulls; consider a small “image warmer” workload to keep your image cached on hosts. - Reduce container boot time - Minimize image size, entrypoint work, dependency install on start, and any external calls (e.g., boto credential fetch patterns) during cold start. Upgrade recommendation - Your worker is on 2.10.20, which is quite old. There have been numerous improvements to worker polling and config since then. I recommend: - Short term: Upgrade to the latest 2.x worker to keep your existing deployments stable. - Better: Move to Prefect 3.x workers when feasible — 3.x exposes clear worker settings like query_seconds and prefetch_seconds and includes reliability/performance improvements.

Marvin

09/03/2025, 3:22 PM

- Settings reference: - Prefect 3 settings reference - Worker healthchecks (3.x) - 2.x docs for context: - Work pools and workers (2.x) - ECS guide (2.x) Helpful artifacts to confirm where the time is spent - Prefect UI run timeline around the minute boundary (Scheduled → Pending/Submitted → Running). - Worker logs around that time window (look for polling, prefetch hits, and submission). - ECS events for a representative run (task created, provisioning, container pulled, running) and CloudWatch metrics on image pull/launch latency. If you can share: - The work pool type + worker start command/env for that 2.10.20 worker - A sample run ID that started ~1–3 minutes late - The output of

prefect config view --show-defaults

for the worker “worker.*” section …I can pinpoint whether the slippage is primarily worker polling, submission throughput, or ECS startup, and give you an exact set of changes.

Mitch

09/03/2025, 3:37 PM

@Marvin We are using docker worker and used

prefect worker start -t docker -p ny4-worker-a --install-policy if-not-present

to start the worker. We are going to upgrade the prefect version and add another worker (we have two as of now) to help reduce this. Here are the worker env variables PREFECT_WORKER_HEARTBEAT_SECONDS='30.0' (from defaults) PREFECT_WORKER_PREFETCH_SECONDS='10.0' (from defaults) PREFECT_WORKER_QUERY_SECONDS='10.0' (from defaults)

Marvin

09/03/2025, 3:37 PM

✅ thought for 80.5 seconds

Marvin

09/03/2025, 3:39 PM

Perfect — thanks for confirming. So this pool is using the Docker worker, started like:

Copy code

prefect worker start -t docker -p ny4-worker-a --install-policy if-not-present

And your current worker settings are defaults: - PREFECT_WORKER_QUERY_SECONDS=10 - PREFECT_WORKER_PREFETCH_SECONDS=10 - PREFECT_WORKER_HEARTBEAT_SECONDS=30 Given your every-minute/sub-minute cadence and bursts at the same second, the biggest wins for reducing 1–3 minute slippage are: 1) Tighten polling and increase prefetch - This reduces pickup jitter and gives the worker time to get containers started before the scheduled minute. - Try:

Copy code

export PREFECT_WORKER_QUERY_SECONDS=5
  export PREFECT_WORKER_PREFETCH_SECONDS=60
  export PREFECT_WORKER_HEARTBEAT_SECONDS=10

Then restart each worker:

Copy code

PREFECT_WORKER_QUERY_SECONDS=5 PREFECT_WORKER_PREFETCH_SECONDS=60 PREFECT_WORKER_HEARTBEAT_SECONDS=10 \
  prefect worker start -t docker -p ny4-worker-a -n ny4-worker-a-1

Notes: - A larger prefetch window can start runs slightly early (by up to the prefetch value). If that’s not acceptable, use 20–30s instead of 60s. - Lower polling means more frequent API calls; keep it reasonable (3–5s is a good target). 2) Add more workers for submission parallelism - You’re adding another worker — good. I’d run at least 3–4 workers for a pool that fans out several runs at the same second. This spreads submissions and container launches. - Keep the same

PREFECT_WORKER_*

settings on each worker. 3) Make Docker startup as fast as possible - Ensure your deployment’s Docker infrastructure uses no/limited pulls at runtime: - Set image_pull_policy to IF_NOT_PRESENT (or NEVER if you manage cache out-of-band) so each run doesn’t re-pull. - Pre-pull the image on the worker host(s) and keep it warm:

Copy code

docker pull your-registry/your-image:tag

- Keep tags immutable so the Docker cache is actually reusable. - Keep images slim and reduce boot work in entrypoints to cut cold-start time. 4) Stagger schedules to avoid a single-second “thundering herd” - If multiple deployments fire at :00, add 1–10s offset or jitter to spread load across the minute. 5) Double-check there aren’t hidden concurrency limits - Pool shows “unlimited”, but per-deployment or tag concurrency limits can still delay runs. Quick check those in the UI. If you’re upgrading Prefect: - Short term: upgrading your 2.10.20 Docker workers to a recent 2.x will help with worker reliability and settings behavior. - Medium term: 3.x has ongoing improvements and clearer worker settings. If/when you move to 3.x, remember: - Use

prefect deploy

(not

prefect deployment build

) and

flow.from_source(...).deploy(...)

in code. - Settings reference: 3.x Settings and worker healthchecks: 3.x Worker healthchecks - 2.x work pools/workers docs (for context): 2.x Work pools and workers Would you be okay with runs starting slightly early (e.g., up to 30–60s) if it eliminates the late starts? If yes, we’ll keep PREFETCH at 60; if not, we’ll tune it to ~20–30s and lean more on worker count and image caching. If you can share: - The deployment’s Docker image and pull policy settings - Whether multiple deployments hit the same second - A representative run where “Scheduled → Submitted/Running” shows the delay …I can fine-tune the exact values (prefetch/polling and worker count) for your load profile.

Nate

09/03/2025, 3:40 PM

seems unlikely to me that changing

PREFECT_WORKER_QUERY_SECONDS

is the right move here

Nate

09/03/2025, 3:41 PM

unless you're scheduling work very quickly and wanting it to run almost real time, i bet something else is causing the lag

Nate

09/03/2025, 3:42 PM

i guess maybe the pull policy?

Mitch

09/03/2025, 3:44 PM

the fastest we scheudule is every minute, then when they are late they lock each other out. I think upgrading version then adding one more worker we can see if that helps? The pull policy should be good since it's doing pull if not present (the image is present). We really only have the issue near start of day or at 9am or so when multiple jobs kick off at once. Could be some latency with the git clone as well through VPN and such which contribute to initialization?

👍 1

Nate

09/03/2025, 3:46 PM

I think upgrading version then adding one more worker we can see if that helps?

im not sure that adding a worker would help. is the worker having trouble submitting containers in time or are the containers slow to begin their flow run processes? if the latter, adding the worker doesn't seem like itd help

Could be some latency with the git clone as well through VPN and such which contribute to initialization?

yea could be. hmm

Mitch

09/03/2025, 4:10 PM

Most of the time we aren't seeing the the late runs (even for this same job later in the day), just for these instances. The timing of submission is just interesting - removed account and workspace from below

Copy code

{
  "id": "068b84aa-c31a-7d34-8000-58bae7094837",
  "account": redacted,
  "event": "prefect.flow-run.Late",
  "occurred": "2025-09-03T14:03:24.194Z",
  "payload": {
    "intended": {
      "from": "SCHEDULED",
      "to": "SCHEDULED"
    },
    "initial_state": {
      "type": "SCHEDULED",
      "name": "Scheduled",
      "message": "Flow run scheduled",
      "timestamp": "2025-09-03T14:00:13.348186+00:00",
      "id": "068b849e-d591-7f36-8000-99fd248e2de8"
    },
    "validated_state": {
      "type": "SCHEDULED",
      "name": "Late",
      "timestamp": "2025-09-03T14:03:24.194071+00:00",
      "id": "068b84aa-c31a-7d34-8000-58bae7094837"
    }
  },
  "received": "2025-09-03T14:03:24.437Z",
  "related": [
    {
      "prefect.resource.id": "prefect.flow.7097c204-5aa4-467c-ab0b-00b2bde718c0",
      "prefect.resource.role": "flow",
      "prefect.resource.name": "executions-strategy-intraday"
    },
    {
      "prefect.resource.id": "prefect.deployment.965785f9-bcb5-436f-bca5-591aecb78f01",
      "prefect.resource.role": "deployment",
      "prefect.resource.name": "executions-strategy-intraday",
      "prefect.deployment.version-id": "068b7425-f2b4-7bf5-8000-2cbc10c14534",
      "prefect.deployment.version-type": "prefect:simple",
      "prefect.deployment.version": ""
    },
    {
      "prefect.resource.id": "prefect.work-queue.8868d7c0-be7c-432e-bbfe-6c84ab546d5c",
      "prefect.resource.role": "work-queue",
      "prefect.resource.name": "default"
    },
    {
      "prefect.resource.id": "prefect.work-pool.a39b04c4-0c5f-46c6-aaf6-2d1b8029206f",
      "prefect.resource.role": "work-pool",
      "prefect.resource.name": "ny4-worker-a",
      "prefect.work-pool.type": "docker"
    },
    {
      "prefect.resource.id": "prefect.tag.auto-scheduled",
      "prefect.resource.role": "tag"
    },
    {
      "prefect.resource.id": "prefect.schedule.a016c5aa-7b0c-4b50-8505-4a794b3df5c6",
      "prefect.resource.role": "creator",
      "prefect.resource.name": "CronSchedule"
    }
  ],
  "resource": {
    "prefect.resource.id": "prefect.flow-run.068b849e-d591-7e93-8000-c30ec2d987e1",
    "prefect.resource.name": "rainbow-beetle",
    "prefect.run-count": "0",
    "prefect.state-message": "",
    "prefect.state-name": "Late",
    "prefect.state-timestamp": "2025-09-03T14:03:24.194071+00:00",
    "prefect.state-type": "SCHEDULED"
  },
  "workspace": redacted
}

Mitch

09/03/2025, 4:11 PM

Scheduled vs validated vs occurred

Mitch

09/09/2025, 1:54 PM

Any thoughts here @Nate based on the above on scheduled vs validated vs occurred?

Nate

09/09/2025, 1:59 PM

hi have you opened an issue where you’ve documented this? it’s hard to keep track of all the slack threads

👍 1

Nate

09/09/2025, 2:00 PM

issues / discussions are the best place to make noise if you feel there’s a bug

Mitch

09/09/2025, 2:05 PM

No, I will look into that, just wondering if this was an us thing in terms of configuration or something else

Mitch

10/15/2025, 4:19 PM

@Marvin Based on the following event data and the occurred, received, initial state, and other metrics what the root cause of the delay was? For example: - Scheduled start vs Created: If Created is minutes after Scheduled, it's scheduler run creation lag at the boundary. - Created vs Actual start: If Created is on time but Start is late, it's worker pickup/capacity.

Copy code

{
  "id": "068b84aa-c31a-7d34-8000-58bae7094837",
  "account": redacted,
  "event": "prefect.flow-run.Late",
  "occurred": "2025-09-03T14:03:24.194Z",
  "payload": {
    "intended": {
      "from": "SCHEDULED",
      "to": "SCHEDULED"
    },
    "initial_state": {
      "type": "SCHEDULED",
      "name": "Scheduled",
      "message": "Flow run scheduled",
      "timestamp": "2025-09-03T14:00:13.348186+00:00",
      "id": "068b849e-d591-7f36-8000-99fd248e2de8"
    },
    "validated_state": {
      "type": "SCHEDULED",
      "name": "Late",
      "timestamp": "2025-09-03T14:03:24.194071+00:00",
      "id": "068b84aa-c31a-7d34-8000-58bae7094837"
    }
  },
  "received": "2025-09-03T14:03:24.437Z",
  "related": [
    {
      "prefect.resource.id": "prefect.flow.7097c204-5aa4-467c-ab0b-00b2bde718c0",
      "prefect.resource.role": "flow",
      "prefect.resource.name": "executions-strategy-intraday"
    },
    {
      "prefect.resource.id": "prefect.deployment.965785f9-bcb5-436f-bca5-591aecb78f01",
      "prefect.resource.role": "deployment",
      "prefect.resource.name": "executions-strategy-intraday",
      "prefect.deployment.version-id": "068b7425-f2b4-7bf5-8000-2cbc10c14534",
      "prefect.deployment.version-type": "prefect:simple",
      "prefect.deployment.version": ""
    },
    {
      "prefect.resource.id": "prefect.work-queue.8868d7c0-be7c-432e-bbfe-6c84ab546d5c",
      "prefect.resource.role": "work-queue",
      "prefect.resource.name": "default"
    },
    {
      "prefect.resource.id": "prefect.work-pool.a39b04c4-0c5f-46c6-aaf6-2d1b8029206f",
      "prefect.resource.role": "work-pool",
      "prefect.resource.name": "ny4-worker-a",
      "prefect.work-pool.type": "docker"
    },
    {
      "prefect.resource.id": "prefect.tag.auto-scheduled",
      "prefect.resource.role": "tag"
    },
    {
      "prefect.resource.id": "prefect.schedule.a016c5aa-7b0c-4b50-8505-4a794b3df5c6",
      "prefect.resource.role": "creator",
      "prefect.resource.name": "CronSchedule"
    }
  ],
  "resource": {
    "prefect.resource.id": "prefect.flow-run.068b849e-d591-7e93-8000-c30ec2d987e1",
    "prefect.resource.name": "rainbow-beetle",
    "prefect.run-count": "0",
    "prefect.state-message": "",
    "prefect.state-name": "Late",
    "prefect.state-timestamp": "2025-09-03T14:03:24.194071+00:00",
    "prefect.state-type": "SCHEDULED"
  },
  "workspace": redacted
}

Marvin

10/15/2025, 4:20 PM

✅ thought for 69.1 seconds

Marvin

10/15/2025, 4:21 PM

Short answer: This was a worker pickup/capacity delay, not a scheduler delay. Why: - Initial SCHEDULED state timestamp: 2025-09-03T140013.348Z. That’s ~13 seconds after the cron boundary — not “minutes,” so no meaningful scheduler-creation lag. - The flow run was still SCHEDULED when it was marked Late at 2025-09-03T140324.194Z (~3m 11s later). Late fires when a run remains SCHEDULED past its due time + offset. Since it never left SCHEDULED, no worker claimed it. - Related resources show it targeted work pool “ny4-worker-a” (docker) and queue “default”. This points to a worker not polling that queue, being at capacity, or otherwise unable to claim the run during that window. What it was not: - Not image pull/infrastructure startup: if a worker had claimed it, the run would have moved to PENDING/RUNNING and wouldn’t be SCHEDULED at Late time. - Not scheduler lag: the creation/scheduling happened within seconds of the boundary. Next checks to confirm and fix: - Ensure a worker was online and heartbeating to the “ny4-worker-a” work pool and subscribed to the “default” queue during 1400–1404 UTC. - Verify pool/queue names in the deployment match the worker’s configuration. - Check worker concurrency limits or pauses on the work pool/queue. - Review worker logs for claim attempts or errors in that timeframe. If you can share the worker logs or whether any other runs were picked up by “ny4-worker-a” around 14:00 UTC, I can help pinpoint whether it was lack of capacity vs. misconfiguration.

Mitch

10/15/2025, 4:23 PM

@Marvin Is this a valid cron schedule in my prefect.yaml

30-59 8 * * *, * 9-14 * * *, 0-30 15 * * *

Marvin

10/15/2025, 4:23 PM

✅ thought for 130.6 seconds

Marvin

10/15/2025, 4:26 PM

Short answer: No. That’s not a valid single cron string in prefect.yaml. You can’t join multiple full cron expressions with commas in one value. What to do instead - Define multiple cron schedules (one per expression) on the deployment. Example (Prefect 3.x):

Copy code

deployments:
- name: executions-strategy-intraday
  entrypoint: path/to/flow.py:flow
  work_pool:
    name: ny4-worker-a
  schedules:
    - cron: "30-59 8 * * *"     # 08:30–08:59
      timezone: "UTC"
    - cron: "* 9-14 * * *"      # 09:00–14:59
      timezone: "UTC"
    - cron: "0-30 15 * * *"     # 15:00–15:30
      timezone: "UTC"

Notes - Commas are only valid within a single field (e.g.,

0,15,30,45 * * * *

), not to separate whole expressions. - Prefect accepts standard 5-field cron. If you want second precision, use a 6-field form with seconds first (e.g.,

0 30-59 8 * * *

), but it’s optional. - Be sure to set the timezone you intend; default is typically UTC. Docs: Prefect 3 schedules. If your goal is “every minute from 08:30 to 15:30 local time,” the three entries above are the right approach. If you’re targeting a different window or timezone, tell me and I’ll adjust the config.

Mitch

10/15/2025, 4:28 PM

@Marvin We are seeing prefect issues where when the 9am cron schedule starts, it is late potentially due to the missing prefetch or conflicts due to another cron in the same flow. How can I fix this?

Marvin

10/15/2025, 4:28 PM

✅ thought for 197.8 seconds

Marvin

10/15/2025, 4:31 PM

Short answer: This is usually caused by either (a) overlapping cron expressions that both emit 09:00 for the same deployment, or (b) worker pickup/capacity at the top-of-hour surge. There’s nothing you need to “prefetch” in Prefect 3.x—runs are auto-generated by the scheduler—but overlapping schedules and insufficient capacity will leave runs in SCHEDULED and trigger Late. How to fix it 1) Make your cron windows mutually exclusive - If you have multiple cron entries on the same deployment, make sure they do not both produce 09:00. - Bad (overlaps at 09:00): -

* 8-9 * * *

and

* 9-14 * * *

→ both create a run at 09:00 - Good (half‑open windows that don’t overlap): -

30-59 8 * * *

# 0830–0859 only -

* 9-14 * * *

# 0900–1459 -

0-30 15 * * *

# 1500–1530 - In prefect.yaml, put each cron in its own schedule entry (don’t comma-join them in one string). Docs: Schedules in Prefect 3 2) Verify only one 09:00 flow run is being created - In the UI, open the deployment → Runs → filter around 0859–0901. - If you see two runs at 09:00 for the same deployment, you’ve got overlapping crons; fix per step 1. 3) Ensure a worker can claim runs at 09:00 - Confirm a worker was online and polling the correct work pool/queue at 09:00 (your earlier event showed pool “ny4-worker-a” and queue “default”). - If many deployments fire at 09:00, scale out workers or dedicate a queue/worker to this deployment to avoid pickup delays. - Check for pool/queue pauses and worker errors; a healthy worker should claim runs within seconds. Docs: Work pools and workers 4) Check concurrency settings and run length - If a previous minute’s run (e.g., 08:59) is still running at 09:00 and your deployment concurrency limit is 1, the 09:00 run will wait in SCHEDULED and can be marked Late. - Options: - Increase the deployment concurrency limit so 09:00 can start while 08:59 finishes, or - Ensure each minute’s run finishes before the next one begins, or - Narrow your schedule cadence (e.g., run every 2–5 minutes) to avoid overlaps. Docs: Concurrency limits 5) Sanity‑check schedule timing vs Late threshold - It’s normal for the initial SCHEDULED state to land a few seconds after the exact cron boundary; that alone doesn’t cause Late. - Late is emitted when a run stays SCHEDULED past a short grace period without being claimed—usually due to worker capacity or concurrency limits. If you can share: - Your exact list of cron entries for this deployment - Whether two runs appear at exactly 09:00 - The work pool/queue and number of workers online at 09:00 - Any deployment concurrency limit you’ve set …I can pinpoint whether this is an overlap vs. capacity issue and give you a concrete edit to your prefect.yaml.

32 Views

Open in Slack

Previous Next