Hey <@ULVA73B9P>! Having a concurrency nightmare w...
# marvin-ai
h
Hey @Marvin! Having a concurrency nightmare with Prefect 3.4.11 and need help. Problem: 20-30 flows running when deployment limits should cap at 10 total. My setup: • Prefect Server 3.4.11 in Docker on EC2 • PostgreSQL on RDS • ECS work pool (push work pool, spawns Fargate tasks per flow) • FastAPI app triggering flows programmatically using
create_flow_run_from_deployment()
What's happening: • Deployment YAML has
concurrency_limit: 10
configured • ECS keeps spawning Fargate tasks without respecting limits I have tried adding a concurrency limit on work pool instead of deployments but that is causing the runs to not even get into pending/running state they are stuck in late. So ideally I need a solution where the concurrency limits are respected.
m
thought for 752.3 seconds
Oof — that’s frustrating. Let’s get this under control. High‑level expectations in Prefect 3.x - Deployment-level concurrency is enforced by the server. When the limit is reached, new runs should go to an AwaitingConcurrencySlot (SCHEDULED) state and only start when a slot frees up. - This applies regardless of how runs are created (including
create_flow_run_from_deployment()
). - Work pool concurrency limits cap the total number of active runs launched from that pool (across deployments). When at capacity, additional runs will sit in SCHEDULED and will often show as Late if their scheduled time has already passed — that’s expected; they should kick off when a slot opens. Docs for reference: - States overview (incl. AwaitingConcurrencySlot and Late): States - Work pools: Work pools - API: create flow run from deployment: API: create_flow_run_from_deployment - ECS push pool guide: ECS guide What likely explains 20–30 runs when the deployment limit is 10 1) The limit isn’t actually set on the deployment you’re triggering (e.g., a different deployment is being used, or the YAML didn’t apply). 2) You may have multiple deployments being triggered — the concurrency limit is per deployment, not global. Two deployments each with 10 can yield ~20 running. 3) Older 3.4.11 has known fixes later for the concurrency lease system. We’ve shipped race-condition fixes and related concurrency improvements since then: - “Fix race conditions in concurrency lease renewal” and “Fix double-release of concurrency limiter token” in 3.4.24. See release notes: Prefect 3.4.24 release What explains “stuck in Late” when you set a work pool concurrency limit - That is the normal state while they’re waiting for a pool slot after their scheduled time has passed. They should transition to Pending/Running automatically when a slot frees up. - If they remain Late indefinitely, it usually means either: - The pool limit is set lower than the number of already-running runs and never frees up, or - The server’s background services aren’t running properly (these services are responsible for progressing SCHEDULED runs). If you’re running Server in Docker, make sure you’re starting it with services enabled; the default
prefect server start
manages that. Quick checks to run now 1) Verify the deployment you’re triggering truly has the limit set: - Run:
Copy code
prefect deployment inspect <flow-name>/<deployment-name>
Confirm there’s a Concurrency limit and that collision strategy is ENQUEUE (queue when at limit). If you don’t see it, your YAML didn’t apply or you’re calling a different deployment. 2) Confirm how many deployments you’re triggering: - If more than one, note that the limit is per deployment. 3) Inspect a couple of the runs that started beyond the limit: - Run:
Copy code
prefect flow-run inspect <flow-run-id>
If these are going straight to Running when the limit is already full, that points to a lease issue in the server version. 4) Validate your server and client versions: - Server is 3.4.11; are your FastAPI client and any workers also 3.4.11? Mismatches can be risky for new features like deployment concurrency. Recommended next steps - Upgrade Prefect Server to at least 3.4.24+ to pick up concurrency lease fixes: - Release notes show relevant fixes: 3.4.24 - Recreate or redeploy the deployment with an explicit concurrency config to be sure it’s applied. In code (Flow.deploy): ``` from prefect import flow from prefect.client.schemas.objects import ConcurrencyLimitConfig @flow def my_flow(): ...
# ENQUEUE will queue new runs at the limit my_flow.deploy( name="prod", work_pool_name="my-ecs-pool", concurrency_limit=ConcurrencyLimitConfig(limit=10, collision_strategy="ENQUEUE"), ) ``` Or in prefect.yaml, ensure
concurrency_limit
is present for the specific deployment. Then run
prefect deploy
and recheck with
prefect deployment inspect
. - If you need a global cap across everything in the pool (not just one deployment), use the work pool concurrency limit set to 10. Expect queued runs to show Late until a slot opens; they should start automatically. If they don’t, check server service logs. A few clarifying questions to pinpoint the issue - Can you share: - The output of
prefect deployment inspect <flow>/<deployment>
for the affected deployment? - A small snippet of how your FastAPI app calls
create_flow_run_from_deployment()
? - Whether there are multiple deployments being triggered concurrently (names)? - Whether you’ve set any work queue concurrency limits in addition to the pool limit? - How you’re starting Prefect Server in Docker (command/compose) so we can confirm services are running? If you want, I can help you test with a tiny repro: set a flow to sleep for 60s, deploy with
concurrency_limit=2
, then trigger 10 runs quickly and verify that only 2 run while the rest enter AwaitingConcurrencySlot — that’ll confirm the server is enforcing limits.
h
@Marvin My worker seems to be using the image prefecthq/prefect:3.4.17-python3.12 . because of version mismatch could there be an issue here? because all the above things seem to be fine.
m
thought for 59.5 seconds
Short answer: the 3.4.11 (server) vs 3.4.17 (flow image) mismatch is unlikely to be the reason deployment-level concurrency is ignored. Deployment concurrency is enforced by the server before any ECS tasks are launched. However, there were important concurrency/lease fixes after both of those versions — so upgrading the server is the most impactful step. Why the mismatch isn’t the root cause - In Prefect 3.x, deployment-level concurrency is enforced at run creation/scheduling on the server, not inside your ECS container. A newer runtime image won’t cause the server to start extra runs. - Work-pool concurrency is also enforced by the server. Relevant fixes shipped after 3.4.17/3.4.11 - We’ve shipped race-condition and lease handling fixes that can affect concurrency enforcement in 3.4.24: - “Fix race conditions in concurrency lease renewal” and “Fix double-release of concurrency limiter token” in 3.4.24 - Release notes: 3.4.24 release What I recommend 1) Upgrade your Prefect Server to at least 3.4.24 (or the latest 3.4.x). Then align your FastAPI app’s Prefect package and your ECS image to the same patch to reduce variables. 2) Re-verify the deployment actually has the limit and is using ENQUEUE:
Copy code
prefect deployment inspect <flow-name>/<deployment-name>
You should see the concurrency limit and the strategy. 3) Sanity test with a repro: - Create a trivial flow that sleeps 60s - Deploy with
concurrency_limit=2
(ENQUEUE) - Trigger 10 runs quickly (via your FastAPI or CLI) - Expect 2 Running, 8 in AwaitingConcurrencySlot (SCHEDULED). ECS should only spin up 2 tasks. 4) If you also use a work pool concurrency limit and see “Late” runs: that’s expected while they wait for a pool slot. They should move to Pending/Running as slots free up. If they remain Late indefinitely, double-check the server is running its background services properly in your Docker setup. Handy references - States (AwaitingConcurrencySlot, Late): States - Work pools: Work pools - API for creating runs from a deployment (respects deployment concurrency): create_flow_run_from_deployment - ECS guide: ECS guide If you can share: - The output of
prefect deployment inspect <flow>/<deployment>
- How you’re starting Prefect Server in Docker (compose/command) - A snippet of your
create_flow_run_from_deployment()
call - Whether multiple deployments are being triggered …I can help pinpoint if there’s anything else at play. My bet is the server upgrade will fix the enforcement, and version-aligning everything will make behavior predictable.
h
@Marvin for the worker what is the proper image address to be used now instead of prefecthq/prefect:3.4.17-python3.12
m
thought for 30.5 seconds
Short answer: pin to a specific 3.4.x patch that includes the concurrency fixes, e.g.
Copy code
prefecthq/prefect:3.4.24-python3.12
or the latest 3.4.x patch with your preferred Python version. Tags are published here: Docker Hub: prefecthq/prefect. Notes: - In Prefect 3 ECS “push” work pools there isn’t a separate long‑running worker container; this image is the runtime image for your flow runs (set on the deployment or via job variables). - It’s best to align your server, your FastAPI client package, and your runtime image to the same 3.4.x patch for predictability. I recommend 3.4.24+ because it includes concurrency lease fixes. - If your flows use AWS SDKs or Prefect AWS integrations, bake them into the image:
Copy code
FROM prefecthq/prefect:3.4.24-python3.12
  RUN pip install -U prefect-aws boto3
How to set the image: - In code:
Copy code
my_flow.deploy(
      name="prod",
      work_pool_name="my-ecs-pool",
      image="prefecthq/prefect:3.4.24-python3.12",
  )
- Or in prefect.yaml:
Copy code
deployments:
    - name: prod
      work_pool: { name: my-ecs-pool }
      image: prefecthq/prefect:3.4.24-python3.12
If you want me to sanity‑check the exact tag you’re considering, share it and I’ll confirm it’s published.