Nick Torba
09/06/2025, 4:36 PMMarvin
09/06/2025, 4:36 PMMarvin
09/06/2025, 4:39 PMprefect work-pool inspect "<your-prod-pool>"
- In the UI, open the Work Pool and check Recent activity for “rate limited” or infra submission errors.
- Docs: Work pools
3) Deployment-level concurrency
- A deployment-level concurrency limit will cap runs regardless of the work pool’s limit:
prefect deployment inspect "<flow-name>/<deployment-name>"
- Look for concurrency_limit
in the output (and whether this prod customer’s deployment is set lower than expected).
4) Global/named concurrency limits
- If your flow or tasks use a concurrency_key
, a global limit on that key can force only 1–2 to run at a time:
prefect concurrency-limit ls
- If you find a limit that’s too low, you can adjust/reset it:
prefect concurrency-limit inspect "<tag>"
prefect concurrency-limit reset "<tag>"
- Docs: Global concurrency limits
AWS/ECS-side checks
Even if Prefect wants to start 20, ECS might only place 1–2 tasks due to capacity:
- Subnet IP exhaustion
- New Fargate tasks need an ENI/IP. Small private subnets or crowded subnets commonly cap you at a handful of tasks. In the ECS console, check if tasks are stuck in PENDING with events like “RESOURCE:ENI” or “unable to place task… no available IP addresses”.
- Fixes: add larger/more subnets across AZs, free up IPs, or reduce per-task ENIs.
- Fargate vCPU or account quotas
- If you’ve hit the Fargate On-Demand vCPU limit, new tasks will sit pending.
- Check Service Quotas and ECS events for “insufficient capacity” messages; request a quota increase or reduce per-task CPU/memory.
- Task definition sizing and placement
- Very large CPU/memory requests can make placement harder; try smaller task sizes if possible.
- Ensure you’re using multiple subnets across AZs for better placement.
- ECS task events and logs
- In the ECS console, check the cluster/service events during the “late” window.
- Look at CloudWatch logs for any failures pulling images (e.g., NAT/ECR access) or permission issues. These can throttle effective concurrency if tasks repeatedly fail to start.
- Prefect ECS guide: Prefect AWS ECS guide
Quick commands summary
- Work pool
prefect work-pool inspect "<your-prod-pool>"
- Deployment
prefect deployment inspect "<flow>/<deployment>"
- Global concurrency limits
prefect concurrency-limit ls
A few questions to narrow it down
- Are dev and prod in the same Prefect Cloud workspace/plan? What is the workspace “Concurrent runs” limit in prod?
- For the prod customer pool, can you share:
- Output of prefect work-pool inspect "<pool-name>"
- Output of prefect concurrency-limit ls
- The deployment name(s) and prefect deployment inspect
output (specifically concurrency_limit
)
- In ECS, are new tasks stuck in PENDING, and do you see events about ENIs/IPs or capacity? Which subnets does this pool use and how many free IPs do they typically have?Marvin
09/06/2025, 4:39 PMNick Torba
09/06/2025, 4:46 PMNick Torba
09/06/2025, 4:47 PMMarvin
09/06/2025, 4:47 PMMarvin
09/06/2025, 4:48 PMconcurrency_limit
set, Prefect will only start that many flow runs for that deployment.
- Workspace concurrent runs cap (Prefect Cloud): A low org/workspace “Concurrent runs” limit will cap total active flow runs across the workspace.
- Flow-level acquisition of a concurrency key: If you explicitly acquire a concurrency slot at the beginning of the flow, that can delay effective work, but the flow run itself will still be created/submitted.
Given you’re seeing many Late runs with only 1–2 active, that usually points to one of:
- A deployment-level limit or workspace limit gating run starts.
- Work pool concurrency in use by other deployments.
- Infra provisioning/capacity issues in ECS (e.g., ENI/IP exhaustion or Fargate vCPU quota), causing submissions to fail/retry so runs stay Late.
Quick checks
- Deployment limit:
prefect deployment inspect "<flow-name>/<deployment-name>"
Look for concurrency_limit
.
- Workspace concurrent runs (Cloud UI): Check the workspace plan/limits page for “Concurrent runs”.
- Work pool status:
prefect work-pool inspect "<your-prod-pool>"
See recent activity/errors and the effective concurrency.
- Global limits (just to confirm what’s defined):
prefect concurrency-limit ls
If you can share:
- The prefect deployment inspect
output for this customer deployment,
- The prefect work-pool inspect
output,
- A screenshot (or message text) from a Late run’s state history (e.g., “not started due to …”),
- Any ECS task events during the backlog window (pending tasks, ENI/IP errors, capacity messages),
…I can pinpoint whether this is a Prefect concurrency gate or an ECS capacity issue and suggest the exact fix.
Refs:
- Work pools: Work pools
- Global concurrency limits: Global concurrency limits
- ECS guide: Prefect AWS ECS guideNick Torba
09/06/2025, 11:53 PMMarvin
09/06/2025, 11:53 PMMarvin
09/06/2025, 11:53 PMprefect concurrency-limit ls
prefect concurrency-limit inspect "<key>"
- Check a flow run’s task state history for “waiting for concurrency slot” messages.
- If needed, temporarily raise or reset a limit:
prefect concurrency-limit reset "<key>"
Given you’re seeing Late runs with only 1–2 flows running, that points to flow submission being throttled elsewhere (workspace/deployment/work pool/ECS capacity), not task limits. I’d next check:
prefect work-pool inspect "<pool>"
prefect deployment inspect "<flow>/<deployment>"
And confirm your workspace “Concurrent runs” limit in the Cloud UI.
Refs:
- Work pools: Work pools
- Global concurrency limits: Global concurrency limits
- ECS guide: Prefect AWS ECS guide
If you can share the work-pool inspect
, deployment inspect
, and concurrency-limit ls
output for prod, I’ll help pinpoint the exact bottleneck.Nick Torba
09/06/2025, 11:55 PMNick Torba
09/07/2025, 12:09 AMMarvin
09/07/2025, 12:09 AMMarvin
09/07/2025, 12:11 AMprefect work-queue ls -p "<your-prod-pool>" -v
- Inspect the specific queue your deployment uses (check the deployment’s work_queue_name
):
prefect work-queue inspect "<queue-name>" -p "<your-prod-pool>"
If you find a low limit, raise it:
prefect work-queue set-concurrency-limit "<queue-name>" 20
Preview what the queue is planning to run (helps spot blocked runs):
prefect work-queue preview "<queue-name>" -p "<your-prod-pool>" -h 2
A few clarifications
- Task concurrency limits: do not block flow submission; they only gate tasks inside a running flow.
- Deployment concurrency limit: will cap the number of flow runs for that deployment, regardless of pool/queue settings.
- Workspace concurrent runs cap (in Prefect Cloud): can cap total active flow runs across the workspace.
- Infra capacity (ECS): if ECS can only place 1–2 tasks (IP exhaustion, vCPU quota), Prefect will appear to “trickle start” runs. Check ECS task events for Pending/ENI/capacity errors.
If you can share:
- prefect work-queue ls -p "<pool>" -v
- prefect work-queue inspect "<queue>" -p "<pool>"
- prefect deployment inspect "<flow>/<deployment>"
(to see work_queue_name
and any concurrency_limit
)
- Any ECS task events during the backlog
…I can pinpoint whether this is a queue concurrency cap, deployment/workspace cap, or ECS capacity issue and suggest the exact fix.
Refs:
- Work pools: Work pools
- Global concurrency limits: Global concurrency limits
- ECS guide: Prefect AWS ECS guideNick Torba
09/07/2025, 12:13 AMMarvin
09/07/2025, 12:13 AMMarvin
09/07/2025, 12:14 AMprefect work-pool inspect "workpool-ecs"
- Inspect the deployment(s) for a concurrency limit and queue target:
prefect deployment inspect "<flow-name>/<deployment-name>"
- Confirm your workspace “Concurrent runs” limit in Prefect Cloud (Settings -> Usage & limits).
- In AWS, check the ECS cluster/service events during the backlog: are new tasks stuck in PENDING with ENI/IP or capacity messages? That’s the most common cause of 1–2 active runs despite a higher Prefect limit.
If you can share:
- The output of prefect work-pool inspect "workpool-ecs"
- The output of prefect deployment inspect "<flow>/<deployment>"
- A snippet from one of the Late runs’ state messages
- Any ECS task events (PENDING reasons)
…I can pinpoint the bottleneck and suggest exact fixes.
Refs:
- Work pools: Work pools
- Global concurrency limits: Global concurrency limits
- ECS guide: Prefect AWS ECS guideNick Torba
09/07/2025, 12:16 AMNick Torba
09/07/2025, 12:17 AMMarvin
09/07/2025, 12:17 AMMarvin
09/07/2025, 12:17 AMtype='ecs:push'
). For push pools:
- Work queues aren’t polled, so a long queue won’t slow submission.
- Work-queue concurrency limits don’t apply.
- The 400 on preview is expected for push pools.
Where to look next
- Pool-level concurrency
- The work-pool inspect
output should show concurrency_limit
. Could you scroll up/down and paste that field? If it’s set to 1–2, that would explain the behavior. If it’s None
, the pool isn’t capping you.
- Deployment-level concurrency
- Please inspect the deployment(s) for this customer:
prefect deployment inspect "<flow-name>/<deployment-name>"
- Look for concurrency_limit
. If it’s set low (e.g., 1–2), only that many runs will start regardless of pool capacity.
- Workspace concurrent runs (Cloud limit)
- In the Prefect Cloud UI, check Settings -> Usage & limits. If your prod workspace has a low “Concurrent runs” cap, you’ll see exactly this symptom (1–2 active with many Late).
- ECS capacity and events (most common with push pools)
- If ECS can only place 1–2 tasks (e.g., due to IP/ENI exhaustion in subnets or Fargate vCPU quota), Prefect will appear to “trickle” run starts.
- In the ECS console (cluster -> Events), during the backlog window, do you see tasks stuck in PENDING with:
- “RESOURCE:ENI” / “no available IP addresses”
- “insufficient vCPU/memory capacity”
- What CPU/memory are you using in prod vs dev? Your base template defaults show CPU=1024, Memory=2048, but if your prod job variables request larger resources, fewer tasks can be placed.
- Which subnets is this pool using? Small or crowded private subnets frequently cap concurrency. Check “Available IPv4 addresses” on those subnets; if low, add larger/more subnets across AZs.
- Also check your region’s Fargate On-Demand vCPU quota (Service Quotas). If each task uses 1 vCPU and you have a 16–32 vCPU quota, you’ll only get 16–32 tasks across the account/region.
To help pinpoint, could you share:
- From `prefect work-pool inspect "workpool-ecs"`: the concurrency_limit
value and any “Recent activity” messages.
- From `prefect deployment inspect "<flow>/<deployment>"`: confirm concurrency_limit
and work_queue_name
.
- A snippet from a Late run’s state history (any “failed to submit infrastructure” messages).
- From ECS: whether tasks are PENDING and the exact event reasons.
References:
- Work pools: Work pools
- Global concurrency limits (for completeness): Global concurrency limits
- ECS tips: Prefect AWS ECS guideNick Torba
09/07/2025, 12:18 AMMarvin
09/07/2025, 12:18 AMMarvin
09/07/2025, 12:19 AMprefect deployment inspect "<flow-name>/<deployment-name>"
- Look for concurrency_limit
and confirm it’s not set to 1–2.
3) ECS placement capacity (likely)
- When the backlog occurs, in ECS (Cluster -> Events), do you see tasks stuck in PENDING with messages like:
- “unable to place task … RESOURCE:ENI” or “no available IP addresses”
- “insufficient vCPU/memory capacity”
- If using Fargate Spot: “capacity not available”
- These indicate:
- Subnet IP exhaustion: small/crowded private subnets cap you at a handful of tasks. Add larger/more subnets across AZs and ensure the pool’s network_configuration
includes them.
- Fargate vCPU quota limits: request a quota increase or reduce per-task CPU/memory.
- Quick subnet IP check (replace with your subnet IDs):
aws ec2 describe-subnets --subnet-ids subnet-abc subnet-def \
--query 'Subnets[].{SubnetId:SubnetId,AvailableIPs:AvailableIpAddressCount}'
A couple of targeted questions to pinpoint:
- What is your workspace “Concurrent runs” limit in prod?
- Output of:
prefect deployment inspect "<flow>/<deployment>"
- From ECS during the backlog: a few cluster/service event lines (especially PENDING reasons).
- Roughly how many subnets are in your network_configuration
and their available IP counts? Are they spread across multiple AZs?
- Are you on Fargate Spot for this pool?
If you can share those, I’ll map the exact fix (e.g., raise workspace limit, adjust deployment concurrency, add subnets, or request Fargate quota increase). For a quick isolation test, try kicking off ~20 tiny runs (e.g., a no-op flow) on this same pool; if ECS only places 1–2 tasks, it’s definitely an ECS capacity issue, not Prefect.Nick Torba
09/07/2025, 12:19 AM- Also check your region’s Fargate On-Demand vCPU quota (Service Quotas). If each task uses 1 vCPU and you have a 16–32 vCPU quota, you’ll only get 16–32 tasks across the account/region.but this would actually be fine. the problem is that it spends ALOT of time with no runs running
Marvin
09/07/2025, 12:19 AMMarvin
09/07/2025, 12:22 AMprefect flow-run ls --state LATE --limit 5
prefect flow-run inspect <run-id>
prefect flow-run logs <run-id>
Look for messages like “Encountered error starting infrastructure”, “rate limited”, “failed to create ECS task”, or repeated submission retries.
2) Work Pool activity (Cloud UI)
- Open the work pool -> Activity. Do you see errors around the times when nothing is running? Copy a couple of those messages if present.
3) ECS cluster/service events during the quiet window
- In the ECS console, check Cluster -> Events for the relevant cluster. Look for:
- “unable to place task… no available IP addresses” (subnet IP/ENI exhaustion)
- “insufficient vCPU/memory capacity”
- “FARGATE_SPOT capacity not available”
- Image pull or permission errors (ECR/NAT)
These exactly correlate with “submission attempts but nothing runs.”
A few targeted configuration checks
- Capacity provider strategy: Are you using Spot-only?
- If yes, intermittent capacity can produce long gaps with zero placements. Try mixing On-Demand as fallback:
capacityProviderStrategy:
- capacityProvider: FARGATE
base: 1
weight: 1
- capacityProvider: FARGATE_SPOT
weight: 1
- Network configuration (subnets/security groups):
- Are you pinned to 1 small private subnet? Check available IPs:
aws ec2 describe-subnets --subnet-ids <subnet-ids> --query 'Subnets[].{SubnetId:SubnetId,AvailableIPs:AvailableIpAddressCount}'
- If low, add more/larger subnets across multiple AZs and include them in the pool’s network_configuration
.
- Task start timing:
- If tasks are created but never start, Prefect will wait up to task_start_timeout_seconds
before marking failure and retrying. A high value here plus repeated placement failures can make it look like nothing’s running. Consider reducing this timeout temporarily to fail fast while you diagnose.
Also double-check the non-infra gates
- Workspace “Concurrent runs” (Cloud UI -> Settings -> Usage & limits).
- prefect deployment inspect "<flow>/<deployment>"
for this customer: confirm concurrency_limit
is not set very low.
If you can share:
- A “Late” flow run’s inspect/logs output,
- A few lines from the Work Pool Activity at the time of the quiet window,
- ECS cluster events (PENDING reasons) and whether you’re using FARGATE_SPOT,
- Your pool’s network_configuration
subnets and their available IP counts,
…I’ll map this to a concrete fix (e.g., add subnets, adjust capacity providers, tweak timeouts, or resolve throttling/permissions).
References:
- Work pools: Work pools
- Global concurrency limits (for completeness): Global concurrency limits
- ECS guide and tips: Prefect AWS ECS guideNick Torba
09/07/2025, 12:23 AMMarvin
09/07/2025, 12:23 AMMarvin
09/07/2025, 12:26 AMconcurrency_limit
or concurrency_options
on the deployment will prevent submission even if the pool can run more. Verify on the exact prod deployment:
prefect deployment inspect "<flow-name>/<deployment-name>"
Look for concurrency_limit
(and concurrency_options
if set).
- Pool concurrency slots “occupied” by phantom/older runs: If the backend thinks slots are in use (e.g., long-running RUNNING/PENDING runs that aren’t actually active in ECS), it won’t submit more. In the Work Pool UI, check the “Active runs” count during the quiet window — does it show non-zero even though ECS shows 0? If yes, find those runs and clear/fix them.
- Automations/pauses: A paused deployment/schedule won’t generate new due runs, but you said you already have Late runs so this is less likely. Still, confirm the deployment isn’t paused and there aren’t automations pausing things.
Concrete next steps to pinpoint the gate
1) Inspect a few Late runs for the state reason
- Grab a few Late runs and check their state history/logs:
prefect flow-run ls --state LATE --limit 5
prefect flow-run inspect <run-id>
prefect flow-run logs <run-id>
Look for messages like “not started due to concurrency limit” or any orchestration reason. If there’s no infra attempt, you won’t see ECS errors — you’ll see scheduling/concurrency reasons.
2) Check the deployment’s gating settings
prefect deployment inspect "<flow-name>/<deployment-name>"
- Share whether concurrency_limit
is set and the work_pool_name
matches workpool-ecs
.
3) Confirm workspace concurrent runs
- In Cloud UI -> Settings -> Usage & limits: what is “Concurrent runs” and how many are in use during the quiet periods? Are other flows in the workspace consuming the slots?
4) Check whether the pool believes slots are in use
- In the Work Pool UI, check “Active runs” and “Recent activity” at the time nothing is running in ECS. If the pool shows active runs or “skipping due to concurrency limit,” that’s our smoking gun.
- As a safe test, you can toggle the pool concurrency to refresh it:
prefect work-pool set-concurrency-limit "workpool-ecs" 31
prefect work-pool set-concurrency-limit "workpool-ecs" 32
This won’t cancel runs; it just resets the limit value and can clear a stale slot counter if that’s the issue.
5) Optional: Preview due work for the pool
- Unlike queue preview, work-pool preview
may work for push pools:
prefect work-pool preview "workpool-ecs" -h 2
If it returns due runs but nothing is being submitted and no infra is attempted, we’re definitely hitting an orchestration cap.
If you can share:
- Output of prefect deployment inspect "<flow>/<deployment>"
(especially concurrency_limit
and work_pool_name
)
- A Late run’s prefect flow-run inspect <id>
state message
- Workspace “Concurrent runs” limit and how many were in use at the time
- Screenshot/text of Work Pool “Active runs” and “Recent activity” during the quiet window
…I’ll map this to an exact fix (raise workspace limit, adjust deployment concurrency, clear/resolve stale runs, or similar).
References:
- States and reasons: States
- Work pools: Work pools
- Schedules: Schedules