Hello <@ULVA73B9P> I have a Push work pool to ECS...
# ask-marvin
n
Hello @Marvin I have a Push work pool to ECS. I have tried out submitting hundreds of flows at a time with our development environment, and my push work pool handles it fine. However, in prod, I have a push work pool for a specific customer. The concurrency limit on flows is 20, but often I find that I will be stuck with 100's of late runs with only one or 2 flows running at a time and I can't figure out what has changed that is causing the issue on this production work pool. Can you help me figure this out?
m
thought for 205.6 seconds
Happy to help debug — this pattern (hundreds of late runs with only 1–2 active) almost always comes down to a concurrency bottleneck somewhere other than the work pool’s limit. Below is a quick triage checklist for Prefect 3.x push work pools on ECS, plus a few targeted questions. If you can share the outputs, we can pinpoint the cause fast. Prefect-level checks 1) Workspace concurrent run limit (Prefect Cloud) - If your prod workspace has a lower “concurrent runs” limit than dev, you’ll only see that many flow runs active across the entire workspace. - Please confirm whether dev and prod are the same workspace/plan. In the Cloud UI, check your workspace’s usage/limits page for the “Concurrent runs” cap. 2) Work pool settings and activity - Verify the pool’s concurrency and look for recent errors in the activity feed:
Copy code
prefect work-pool inspect "<your-prod-pool>"
- In the UI, open the Work Pool and check Recent activity for “rate limited” or infra submission errors. - Docs: Work pools 3) Deployment-level concurrency - A deployment-level concurrency limit will cap runs regardless of the work pool’s limit:
Copy code
prefect deployment inspect "<flow-name>/<deployment-name>"
- Look for
concurrency_limit
in the output (and whether this prod customer’s deployment is set lower than expected). 4) Global/named concurrency limits - If your flow or tasks use a
concurrency_key
, a global limit on that key can force only 1–2 to run at a time:
Copy code
prefect concurrency-limit ls
- If you find a limit that’s too low, you can adjust/reset it:
Copy code
prefect concurrency-limit inspect "<tag>"
prefect concurrency-limit reset "<tag>"
- Docs: Global concurrency limits AWS/ECS-side checks Even if Prefect wants to start 20, ECS might only place 1–2 tasks due to capacity: - Subnet IP exhaustion - New Fargate tasks need an ENI/IP. Small private subnets or crowded subnets commonly cap you at a handful of tasks. In the ECS console, check if tasks are stuck in PENDING with events like “RESOURCE:ENI” or “unable to place task… no available IP addresses”. - Fixes: add larger/more subnets across AZs, free up IPs, or reduce per-task ENIs. - Fargate vCPU or account quotas - If you’ve hit the Fargate On-Demand vCPU limit, new tasks will sit pending. - Check Service Quotas and ECS events for “insufficient capacity” messages; request a quota increase or reduce per-task CPU/memory. - Task definition sizing and placement - Very large CPU/memory requests can make placement harder; try smaller task sizes if possible. - Ensure you’re using multiple subnets across AZs for better placement. - ECS task events and logs - In the ECS console, check the cluster/service events during the “late” window. - Look at CloudWatch logs for any failures pulling images (e.g., NAT/ECR access) or permission issues. These can throttle effective concurrency if tasks repeatedly fail to start. - Prefect ECS guide: Prefect AWS ECS guide Quick commands summary - Work pool
Copy code
prefect work-pool inspect "<your-prod-pool>"
- Deployment
Copy code
prefect deployment inspect "<flow>/<deployment>"
- Global concurrency limits
Copy code
prefect concurrency-limit ls
A few questions to narrow it down - Are dev and prod in the same Prefect Cloud workspace/plan? What is the workspace “Concurrent runs” limit in prod? - For the prod customer pool, can you share: - Output of
prefect work-pool inspect "<pool-name>"
- Output of
prefect concurrency-limit ls
- The deployment name(s) and
prefect deployment inspect
output (specifically
concurrency_limit
) - In ECS, are new tasks stuck in PENDING, and do you see events about ENIs/IPs or capacity? Which subnets does this pool use and how many free IPs do they typically have?
If you paste those details (or screenshots of the UI pages), I’ll help pinpoint the bottleneck and propose exact fixes.
n
They get stuck in LATE, not PENDING. They are never submitted to the infra (or they are, but after a very very long time)
I do have some task concurrency limits, but that shouldn't affect the actual submitting of flows? right? @Marvin
m
thought for 56.2 seconds
Short answer: you’re right — standard task-level concurrency limits do not prevent Prefect from submitting/starting flow runs in a push work pool. They throttle tasks inside a running flow, not the number of flow runs the pool launches. When could “concurrency” affect flow submission? - Deployment concurrency limit: If the deployment has
concurrency_limit
set, Prefect will only start that many flow runs for that deployment. - Workspace concurrent runs cap (Prefect Cloud): A low org/workspace “Concurrent runs” limit will cap total active flow runs across the workspace. - Flow-level acquisition of a concurrency key: If you explicitly acquire a concurrency slot at the beginning of the flow, that can delay effective work, but the flow run itself will still be created/submitted. Given you’re seeing many Late runs with only 1–2 active, that usually points to one of: - A deployment-level limit or workspace limit gating run starts. - Work pool concurrency in use by other deployments. - Infra provisioning/capacity issues in ECS (e.g., ENI/IP exhaustion or Fargate vCPU quota), causing submissions to fail/retry so runs stay Late. Quick checks - Deployment limit:
Copy code
prefect deployment inspect "<flow-name>/<deployment-name>"
Look for
concurrency_limit
. - Workspace concurrent runs (Cloud UI): Check the workspace plan/limits page for “Concurrent runs”. - Work pool status:
Copy code
prefect work-pool inspect "<your-prod-pool>"
See recent activity/errors and the effective concurrency. - Global limits (just to confirm what’s defined):
Copy code
prefect concurrency-limit ls
If you can share: - The
prefect deployment inspect
output for this customer deployment, - The
prefect work-pool inspect
output, - A screenshot (or message text) from a Late run’s state history (e.g., “not started due to …”), - Any ECS task events during the backlog window (pending tasks, ENI/IP errors, capacity messages), …I can pinpoint whether this is a Prefect concurrency gate or an ECS capacity issue and suggest the exact fix. Refs: - Work pools: Work pools - Global concurrency limits: Global concurrency limits - ECS guide: Prefect AWS ECS guide
n
@Marvin Could concurrency limits block the submission of tasks?
m
thought for 37.9 seconds
Great question. Short answer: task-level concurrency limits do not block the submission of flow runs by a push work pool. They only gate when individual task runs inside an already-started flow can begin. What they do affect - Once a flow run is running, any task with a concurrency key will wait for a free slot before starting. - In the UI you’ll see flow runs in Running, but specific task runs in a waiting state (e.g., “waiting for concurrency slot”). - Logs will show messages like “Waiting for concurrency slot for '<key>'”. What they do not affect - They do not prevent the work pool from launching flow runs (so they won’t cause Late flow runs). - They do not reduce the number of flow runs the pool attempts to start. That’s controlled by: - Workspace concurrent runs cap (Cloud plan) - Work pool concurrency - Deployment-level concurrency - Infra capacity (ECS/Fargate IPs/quotas) How to verify whether concurrency limits are affecting task starts - List limits and see which keys are saturated:
Copy code
prefect concurrency-limit ls
prefect concurrency-limit inspect "<key>"
- Check a flow run’s task state history for “waiting for concurrency slot” messages. - If needed, temporarily raise or reset a limit:
Copy code
prefect concurrency-limit reset "<key>"
Given you’re seeing Late runs with only 1–2 flows running, that points to flow submission being throttled elsewhere (workspace/deployment/work pool/ECS capacity), not task limits. I’d next check:
Copy code
prefect work-pool inspect "<pool>"
prefect deployment inspect "<flow>/<deployment>"
And confirm your workspace “Concurrent runs” limit in the Cloud UI. Refs: - Work pools: Work pools - Global concurrency limits: Global concurrency limits - ECS guide: Prefect AWS ECS guide If you can share the
work-pool inspect
,
deployment inspect
, and
concurrency-limit ls
output for prod, I’ll help pinpoint the exact bottleneck.
n
Ok, if that is the case, then it is definitely not concurrency tags causing my problem. All the flows just sit in LATE status, never actually submitting to the infra
@Marvin It seems like maybe the task submission on the work pool is getting thrown off by a long queue?
m
thought for 116.0 seconds
A long work queue by itself won’t slow down submission. The scheduler will keep starting due runs up to whatever limits are in effect. However, a work queue can have its own concurrency limit that caps how many flow runs from that queue can start at once — and that often looks like “only 1–2 running” with a big Late backlog. What to check right now - List queues for the pool and see if any have a low concurrency limit:
Copy code
prefect work-queue ls -p "<your-prod-pool>" -v
- Inspect the specific queue your deployment uses (check the deployment’s
work_queue_name
):
Copy code
prefect work-queue inspect "<queue-name>" -p "<your-prod-pool>"
If you find a low limit, raise it:
Copy code
prefect work-queue set-concurrency-limit "<queue-name>" 20
Preview what the queue is planning to run (helps spot blocked runs):
Copy code
prefect work-queue preview "<queue-name>" -p "<your-prod-pool>" -h 2
A few clarifications - Task concurrency limits: do not block flow submission; they only gate tasks inside a running flow. - Deployment concurrency limit: will cap the number of flow runs for that deployment, regardless of pool/queue settings. - Workspace concurrent runs cap (in Prefect Cloud): can cap total active flow runs across the workspace. - Infra capacity (ECS): if ECS can only place 1–2 tasks (IP exhaustion, vCPU quota), Prefect will appear to “trickle start” runs. Check ECS task events for Pending/ENI/capacity errors. If you can share: -
prefect work-queue ls -p "<pool>" -v
-
prefect work-queue inspect "<queue>" -p "<pool>"
-
prefect deployment inspect "<flow>/<deployment>"
(to see
work_queue_name
and any
concurrency_limit
) - Any ECS task events during the backlog …I can pinpoint whether this is a queue concurrency cap, deployment/workspace cap, or ECS capacity issue and suggest the exact fix. Refs: - Work pools: Work pools - Global concurrency limits: Global concurrency limits - ECS guide: Prefect AWS ECS guide
n
@Marvin prefect work-queue ls -p work-pool-ecs 201136.408 | DEBUG | prefect.profiles - Using profile 'default' 201136.908 | DEBUG | prefect.client - Connecting to API at https://api.prefect.cloud/api/accounts/971ef92b-3611-4399-afff-8201adaa1016/workspaces/ea754a79-f46a-4cbd-8b5b-c1ca6912a885/ Work Queues in Work Pool 'workpool-ecs' ┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ Name ┃ Priority ┃ Concurrency Limit ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │ scheduled │ 2 │ None │ │ on-demand │ 1 │ None │ │ default │ 3 │ None │ └───────────┴──────────┴───────────────────┘ (**) denotes a paused queue Documents/GitHub/djangoproject on 🌱 share-364-cross-product-contamination [$!] 🐍 v3.10.18 🧪 ask-prefect ☁️ @us-east-1 ⏱️ 3s ❯ prefect work-queue preview "on-demand" -p "workpool-ecs" -h 2 201215.547 | DEBUG | prefect.profiles - Using profile 'default' 201216.028 | DEBUG | prefect.client - Connecting to API at https://api.prefect.cloud/api/accounts/971ef92b-3611-4399-afff-8201adaa1016/workspaces/ea754a79-f46a-4cbd-8b5b-c1ca6912a885/ 201216.227 | DEBUG | prefect.client - Connecting to API at https://api.prefect.cloud/api/accounts/971ef92b-3611-4399-afff-8201adaa1016/workspaces/ea754a79-f46a-4cbd-8b5b-c1ca6912a885/ Traceback (most recent call last): File "/opt/miniconda3/envs/ask-prefect/lib/python3.10/site-packages/prefect/cli/_utilities.py", line 44, in wrapper return fn(*args, **kwargs) File "/opt/miniconda3/envs/ask-prefect/lib/python3.10/site-packages/prefect/cli/_types.py", line 156, in sync_fn return asyncio.run(async_fn(*args, **kwargs)) File "/opt/miniconda3/envs/ask-prefect/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/miniconda3/envs/ask-prefect/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/opt/miniconda3/envs/ask-prefect/lib/python3.10/site-packages/prefect/cli/work_queue.py", line 479, in preview responses = await client.get_scheduled_flow_runs_for_work_pool( File "/opt/miniconda3/envs/ask-prefect/lib/python3.10/site-packages/prefect/client/orchestration/_work_pools/client.py", line 586, in get_scheduled_flow_runs_for_work_pool response = await self.request( File "/opt/miniconda3/envs/ask-prefect/lib/python3.10/site-packages/prefect/client/orchestration/base.py", line 53, in request return await self._client.send(request) File "/opt/miniconda3/envs/ask-prefect/lib/python3.10/site-packages/prefect/client/base.py", line 361, in send response.raise_for_status() File "/opt/miniconda3/envs/ask-prefect/lib/python3.10/site-packages/prefect/client/base.py", line 162, in raise_for_status raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.cause prefect.exceptions.PrefectHTTPStatusError: Client error '400 Bad Request' for url 'https://api.prefect.cloud/api/accounts/971ef92b-3611-4399-afff-8201adaa1016/workspaces/ea754a79-f46a-4cbd-8b5b-c1ca6912a885/work_pools/consumeredge-ecs/get_scheduled_flow_runs' Response: {'detail': 'Work pool "workpool-ecs" is a push pool and does not support polling.'} For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400
m
thought for 69.5 seconds
You’ve hit an expected limitation: push work pools don’t support queue polling/preview. That 400 error is telling us the same thing — push pools don’t use workers to poll queues, so a “long queue” won’t slow submission by itself, and work-queue concurrency limits don’t apply to push pools. What does control submission for push pools - Workspace concurrent runs (Cloud plan): caps total active flow runs. - Work pool concurrency limit: caps runs started by that pool. - Deployment-level concurrency: caps runs for a deployment regardless of pool capacity. - Infrastructure capacity (ECS): if ECS can only place 1–2 tasks (e.g., ENI/IP exhaustion or Fargate vCPU quotas), it will look like Prefect is “trickling” runs. What to check next - Inspect the pool and its recent activity/errors:
Copy code
prefect work-pool inspect "workpool-ecs"
- Inspect the deployment(s) for a concurrency limit and queue target:
Copy code
prefect deployment inspect "<flow-name>/<deployment-name>"
- Confirm your workspace “Concurrent runs” limit in Prefect Cloud (Settings -> Usage & limits). - In AWS, check the ECS cluster/service events during the backlog: are new tasks stuck in PENDING with ENI/IP or capacity messages? That’s the most common cause of 1–2 active runs despite a higher Prefect limit. If you can share: - The output of
prefect work-pool inspect "workpool-ecs"
- The output of
prefect deployment inspect "<flow>/<deployment>"
- A snippet from one of the Late runs’ state messages - Any ECS task events (PENDING reasons) …I can pinpoint the bottleneck and suggest exact fixes. Refs: - Work pools: Work pools - Global concurrency limits: Global concurrency limits - ECS guide: Prefect AWS ECS guide
n
}, 'job_configuration': { 'env': '{{ env }}', 'name': '{{ name }}', 'labels': '{{ labels }}', 'vpc_id': '{{ vpc_id }}', 'cluster': '{{ cluster }}', 'command': '{{ command }}', 'stream_output': '{{ stream_output }}', 'container_name': '{{ container_name }}', 'aws_credentials': '{{ aws_credentials }}', 'task_definition': { 'cpu': '{{ cpu }}', 'family': '{{ family }}', 'memory': '{{ memory }}', 'ephemeralStorage': {'sizeInGiB': '{{ ephemeral_storage }}'}, 'executionRoleArn': '{{ execution_role_arn }}', 'containerDefinitions': [{'name': '{{ container_name }}', 'image': '{{ image }}'}] }, 'task_run_request': { 'tags': '{{ labels }}', 'cluster': '{{ cluster }}', 'overrides': { 'cpu': '{{ cpu }}', 'memory': '{{ memory }}', 'taskRoleArn': '{{ task_role_arn }}', 'containerOverrides': [ { 'cpu': '{{ cpu }}', 'name': '{{ container_name }}', 'memory': '{{ memory }}', 'command': '{{ command }}', 'environment': '{{ env }}' } ] }, 'launchType': '{{ launch_type }}', 'taskDefinition': '{{ task_definition_arn }}', 'capacityProviderStrategy': '{{ capacity_provider_strategy }}' }, 'network_configuration': '{{ network_configuration }}', 'cloudwatch_logs_prefix': '{{ cloudwatch_logs_prefix }}', 'cloudwatch_logs_options': '{{ cloudwatch_logs_options }}', 'task_watch_poll_interval': '{{ task_watch_poll_interval }}', 'configure_cloudwatch_logs': '{{ configure_cloudwatch_logs }}', 'task_start_timeout_seconds': '{{ task_start_timeout_seconds }}', 'auto_deregister_task_definition': '{{ auto_deregister_task_definition }}', 'match_latest_revision_in_family': '{{ match_latest_revision_in_family }}' } }, concurrency_limit=32, status=WorkPoolStatus.READY, storage_configuration=WorkPoolStorageConfiguration(), default_queue_id='8360ac25-7956-4934-b10c-853ceaf5953b'
@Marvin ❯ prefect work-pool inspect "workpool-ecs" 201446.894 | DEBUG | prefect.profiles - Using profile 'default' 201447.395 | DEBUG | prefect.client - Connecting to API at https:_//api.prefect.cloud/api/accounts/971ef92b-3611-4399-afff-8201adaa1016/workspaces/ea754a79-f46a-4cbd-8b5b-c1ca6912a885/_ WorkPool( id='e7ca7d74-8e73-4d29-9666-531afda84f4b', created=DateTime(2025, 8, 29, 17, 32, 29, 953615, tzinfo=Timezone('UTC')), updated=DateTime(2025, 9, 7, 0, 9, 54, 918831, tzinfo=Timezone('UTC')), name='workpool-ecs', type='ecs:push', base_job_template={ 'variables': { 'type': 'object', 'properties': { 'cpu': { 'type': 'integer', 'title': 'CPU', 'description': 'The amount of CPU to provide to the ECS task. Valid amounts are specified in the AWS documentation. If not provided, a default value of 1024 will be used unless present on the task definition.' }, 'env': { 'type': 'object', 'title': 'Environment Variables', 'default': {'ENVIRONMENT': 'production'}, 'description': 'Environment variables to provide to the task run. These variables are set on the Prefect container at task runtime. These will not be set on the task definition.', 'additionalProperties': {'type': 'string'} }, 'name': { 'type': 'string', 'title': 'Name', 'description': 'Name given to infrastructure created by a worker.' }, 'image': { 'type': 'string', 'title': 'Image', 'default': '143690486798.dkr.ecr.us-east-1.amazonaws.com/amplifydata-production-prefect-ecs:prod-3ee85cdb66a014df5c9acad85 c014831a76cd9e4-b348', 'description': 'The image to use for the Prefect container in the task. If this value is not null, it will override the value in the task definition. This value defaults to a Prefect base image matching your local versions.' }, 'family': { 'type': 'string', 'title': 'Family', 'default': 'production_workpool-ecs', 'description': 'A family for the task definition. If not provided, it will be inferred from the task definition. If the task definition does not have a family, the name will be generated. When flow and deployment metadata is available, the generated name will include their names. Values for this field will be slugified to match AWS character requirements.' }, 'labels': { 'type': 'object', 'title': 'Labels', 'description': 'Labels applied to infrastructure created by a worker.', 'additionalProperties': {'type': 'string'} }, 'memory': { 'type': 'integer', 'title': 'Memory', 'description': 'The amount of memory to provide to the ECS task. Valid amounts are specified in the AWS documentation. If not provided, a default value of 2048 will be used unless present on the task definition.' }, 'vpc_id': { 'type': 'string',
m
thought for 49.9 seconds
Thanks for the details — that output confirms this is a push pool (
type='ecs:push'
). For push pools: - Work queues aren’t polled, so a long queue won’t slow submission. - Work-queue concurrency limits don’t apply. - The 400 on preview is expected for push pools. Where to look next - Pool-level concurrency - The
work-pool inspect
output should show
concurrency_limit
. Could you scroll up/down and paste that field? If it’s set to 1–2, that would explain the behavior. If it’s
None
, the pool isn’t capping you. - Deployment-level concurrency - Please inspect the deployment(s) for this customer:
Copy code
prefect deployment inspect "<flow-name>/<deployment-name>"
- Look for
concurrency_limit
. If it’s set low (e.g., 1–2), only that many runs will start regardless of pool capacity. - Workspace concurrent runs (Cloud limit) - In the Prefect Cloud UI, check Settings -> Usage & limits. If your prod workspace has a low “Concurrent runs” cap, you’ll see exactly this symptom (1–2 active with many Late). - ECS capacity and events (most common with push pools) - If ECS can only place 1–2 tasks (e.g., due to IP/ENI exhaustion in subnets or Fargate vCPU quota), Prefect will appear to “trickle” run starts. - In the ECS console (cluster -> Events), during the backlog window, do you see tasks stuck in PENDING with: - “RESOURCE:ENI” / “no available IP addresses” - “insufficient vCPU/memory capacity” - What CPU/memory are you using in prod vs dev? Your base template defaults show CPU=1024, Memory=2048, but if your prod job variables request larger resources, fewer tasks can be placed. - Which subnets is this pool using? Small or crowded private subnets frequently cap concurrency. Check “Available IPv4 addresses” on those subnets; if low, add larger/more subnets across AZs. - Also check your region’s Fargate On-Demand vCPU quota (Service Quotas). If each task uses 1 vCPU and you have a 16–32 vCPU quota, you’ll only get 16–32 tasks across the account/region. To help pinpoint, could you share: - From `prefect work-pool inspect "workpool-ecs"`: the
concurrency_limit
value and any “Recent activity” messages. - From `prefect deployment inspect "<flow>/<deployment>"`: confirm
concurrency_limit
and
work_queue_name
. - A snippet from a Late run’s state history (any “failed to submit infrastructure” messages). - From ECS: whether tasks are PENDING and the exact event reasons. References: - Work pools: Work pools - Global concurrency limits (for completeness): Global concurrency limits - ECS tips: Prefect AWS ECS guide
n
@Marvin 'job_configuration': { 'env': '{{ env }}', 'name': '{{ name }}', 'labels': '{{ labels }}', 'vpc_id': '{{ vpc_id }}', 'cluster': '{{ cluster }}', 'command': '{{ command }}', 'stream_output': '{{ stream_output }}', 'container_name': '{{ container_name }}', 'aws_credentials': '{{ aws_credentials }}', 'task_definition': { 'cpu': '{{ cpu }}', 'family': '{{ family }}', 'memory': '{{ memory }}', 'ephemeralStorage': {'sizeInGiB': '{{ ephemeral_storage }}'}, 'executionRoleArn': '{{ execution_role_arn }}', 'containerDefinitions': [{'name': '{{ container_name }}', 'image': '{{ image }}'}] }, 'task_run_request': { 'tags': '{{ labels }}', 'cluster': '{{ cluster }}', 'overrides': { 'cpu': '{{ cpu }}', 'memory': '{{ memory }}', 'taskRoleArn': '{{ task_role_arn }}', 'containerOverrides': [ { 'cpu': '{{ cpu }}', 'name': '{{ container_name }}', 'memory': '{{ memory }}', 'command': '{{ command }}', 'environment': '{{ env }}' } ] }, 'launchType': '{{ launch_type }}', 'taskDefinition': '{{ task_definition_arn }}', 'capacityProviderStrategy': '{{ capacity_provider_strategy }}' }, 'network_configuration': '{{ network_configuration }}', 'cloudwatch_logs_prefix': '{{ cloudwatch_logs_prefix }}', 'cloudwatch_logs_options': '{{ cloudwatch_logs_options }}', 'task_watch_poll_interval': '{{ task_watch_poll_interval }}', 'configure_cloudwatch_logs': '{{ configure_cloudwatch_logs }}', 'task_start_timeout_seconds': '{{ task_start_timeout_seconds }}', 'auto_deregister_task_definition': '{{ auto_deregister_task_definition }}', 'match_latest_revision_in_family': '{{ match_latest_revision_in_family }}' } }, concurrency_limit=32, status=WorkPoolStatus.READY, storage_configuration=WorkPoolStorageConfiguration(), default_queue_id='8360ac25-7956-4934-b10c-853ceaf5953b'
m
thought for 33.3 seconds
Thanks — that shows your push pool is healthy and has a pool-level concurrency_limit of 32, so Prefect is not capping you at 1–2 from the pool side. Given this is an ecs:push pool, the usual culprits for “hundreds of Late runs, only 1–2 running” are: - A workspace “Concurrent runs” cap (Cloud usage/limits) - A deployment-level concurrency limit on the affected deployment(s) - ECS can’t place more than 1–2 tasks at a time (most common) Next best checks 1) Workspace limit (Cloud UI) - Settings -> Usage & limits -> “Concurrent runs”. If this is ≤2, that exactly matches your symptom. 2) Deployment config - Check the deployment feeding this pool:
Copy code
prefect deployment inspect "<flow-name>/<deployment-name>"
- Look for
concurrency_limit
and confirm it’s not set to 1–2. 3) ECS placement capacity (likely) - When the backlog occurs, in ECS (Cluster -> Events), do you see tasks stuck in PENDING with messages like: - “unable to place task … RESOURCE:ENI” or “no available IP addresses” - “insufficient vCPU/memory capacity” - If using Fargate Spot: “capacity not available” - These indicate: - Subnet IP exhaustion: small/crowded private subnets cap you at a handful of tasks. Add larger/more subnets across AZs and ensure the pool’s
network_configuration
includes them. - Fargate vCPU quota limits: request a quota increase or reduce per-task CPU/memory. - Quick subnet IP check (replace with your subnet IDs):
Copy code
aws ec2 describe-subnets --subnet-ids subnet-abc subnet-def \
  --query 'Subnets[].{SubnetId:SubnetId,AvailableIPs:AvailableIpAddressCount}'
A couple of targeted questions to pinpoint: - What is your workspace “Concurrent runs” limit in prod? - Output of:
Copy code
prefect deployment inspect "<flow>/<deployment>"
- From ECS during the backlog: a few cluster/service event lines (especially PENDING reasons). - Roughly how many subnets are in your
network_configuration
and their available IP counts? Are they spread across multiple AZs? - Are you on Fargate Spot for this pool? If you can share those, I’ll map the exact fix (e.g., raise workspace limit, adjust deployment concurrency, add subnets, or request Fargate quota increase). For a quick isolation test, try kicking off ~20 tiny runs (e.g., a no-op flow) on this same pool; if ECS only places 1–2 tasks, it’s definitely an ECS capacity issue, not Prefect.
n
@Marvin YOu said this:
- Also check your region’s Fargate On-Demand vCPU quota (Service Quotas). If each task uses 1 vCPU and you have a 16–32 vCPU quota, you’ll only get 16–32 tasks across the account/region.
but this would actually be fine. the problem is that it spends ALOT of time with no runs running
m
thought for 188.0 seconds
Got it — if you’re seeing long stretches with zero flows running despite a big backlog, that usually means Prefect is trying to submit runs but submission is failing or stalling, not that you’re hitting a vCPU cap. For ecs:push pools, the most common causes of “nothing is running” are: - ECS cannot place tasks at that moment (Spot capacity gaps, ENI/IP exhaustion, insufficient capacity). Prefect attempts submission, gets back errors or tasks that never start, then backs off/retries — during that time you’ll see Late runs and zero active. - AWS API throttling or IAM/permission errors during RunTask or DescribeTasks — Prefect retries with backoff so apparent throughput drops to zero for a while. - Deployment-level concurrency limit or workspace concurrent runs cap set very low (worth confirming, but those usually still keep 1–2 running rather than zero). Let’s capture evidence to confirm submission stalls 1) Check a few “Late” runs’ history and logs:
Copy code
prefect flow-run ls --state LATE --limit 5
prefect flow-run inspect <run-id>
prefect flow-run logs <run-id>
Look for messages like “Encountered error starting infrastructure”, “rate limited”, “failed to create ECS task”, or repeated submission retries. 2) Work Pool activity (Cloud UI) - Open the work pool -> Activity. Do you see errors around the times when nothing is running? Copy a couple of those messages if present. 3) ECS cluster/service events during the quiet window - In the ECS console, check Cluster -> Events for the relevant cluster. Look for: - “unable to place task… no available IP addresses” (subnet IP/ENI exhaustion) - “insufficient vCPU/memory capacity” - “FARGATE_SPOT capacity not available” - Image pull or permission errors (ECR/NAT) These exactly correlate with “submission attempts but nothing runs.” A few targeted configuration checks - Capacity provider strategy: Are you using Spot-only? - If yes, intermittent capacity can produce long gaps with zero placements. Try mixing On-Demand as fallback:
Copy code
capacityProviderStrategy:
    - capacityProvider: FARGATE
      base: 1
      weight: 1
    - capacityProvider: FARGATE_SPOT
      weight: 1
- Network configuration (subnets/security groups): - Are you pinned to 1 small private subnet? Check available IPs:
Copy code
aws ec2 describe-subnets --subnet-ids <subnet-ids> --query 'Subnets[].{SubnetId:SubnetId,AvailableIPs:AvailableIpAddressCount}'
- If low, add more/larger subnets across multiple AZs and include them in the pool’s
network_configuration
. - Task start timing: - If tasks are created but never start, Prefect will wait up to
task_start_timeout_seconds
before marking failure and retrying. A high value here plus repeated placement failures can make it look like nothing’s running. Consider reducing this timeout temporarily to fail fast while you diagnose. Also double-check the non-infra gates - Workspace “Concurrent runs” (Cloud UI -> Settings -> Usage & limits). -
prefect deployment inspect "<flow>/<deployment>"
for this customer: confirm
concurrency_limit
is not set very low. If you can share: - A “Late” flow run’s inspect/logs output, - A few lines from the Work Pool Activity at the time of the quiet window, - ECS cluster events (PENDING reasons) and whether you’re using FARGATE_SPOT, - Your pool’s
network_configuration
subnets and their available IP counts, …I’ll map this to a concrete fix (e.g., add subnets, adjust capacity providers, tweak timeouts, or resolve throttling/permissions). References: - Work pools: Work pools - Global concurrency limits (for completeness): Global concurrency limits - ECS guide and tips: Prefect AWS ECS guide
n
@Marvin I would LOVE if I was getting infra errors. But I am not. they are just late. Not even trying to hit the infra
m
thought for 151.4 seconds
Thanks for clarifying — if there are long stretches with zero runs and you don’t see any infra submissions at all, we should focus on orchestration gates rather than ECS. What most often causes “nothing starts and no infra attempts” in push pools - Workspace concurrent runs cap: If the workspace limit is reached by other runs, new runs will sit Scheduled/Late and you’ll see no infra attempts. This can look like “dead air” even if the pool has a high limit. Please check Cloud UI -> Settings -> Usage & limits -> Concurrent runs. - Deployment-level gate: A
concurrency_limit
or
concurrency_options
on the deployment will prevent submission even if the pool can run more. Verify on the exact prod deployment:
Copy code
prefect deployment inspect "<flow-name>/<deployment-name>"
Look for
concurrency_limit
(and
concurrency_options
if set). - Pool concurrency slots “occupied” by phantom/older runs: If the backend thinks slots are in use (e.g., long-running RUNNING/PENDING runs that aren’t actually active in ECS), it won’t submit more. In the Work Pool UI, check the “Active runs” count during the quiet window — does it show non-zero even though ECS shows 0? If yes, find those runs and clear/fix them. - Automations/pauses: A paused deployment/schedule won’t generate new due runs, but you said you already have Late runs so this is less likely. Still, confirm the deployment isn’t paused and there aren’t automations pausing things. Concrete next steps to pinpoint the gate 1) Inspect a few Late runs for the state reason - Grab a few Late runs and check their state history/logs:
Copy code
prefect flow-run ls --state LATE --limit 5
prefect flow-run inspect <run-id>
prefect flow-run logs <run-id>
Look for messages like “not started due to concurrency limit” or any orchestration reason. If there’s no infra attempt, you won’t see ECS errors — you’ll see scheduling/concurrency reasons. 2) Check the deployment’s gating settings
Copy code
prefect deployment inspect "<flow-name>/<deployment-name>"
- Share whether
concurrency_limit
is set and the
work_pool_name
matches
workpool-ecs
. 3) Confirm workspace concurrent runs - In Cloud UI -> Settings -> Usage & limits: what is “Concurrent runs” and how many are in use during the quiet periods? Are other flows in the workspace consuming the slots? 4) Check whether the pool believes slots are in use - In the Work Pool UI, check “Active runs” and “Recent activity” at the time nothing is running in ECS. If the pool shows active runs or “skipping due to concurrency limit,” that’s our smoking gun. - As a safe test, you can toggle the pool concurrency to refresh it:
Copy code
prefect work-pool set-concurrency-limit "workpool-ecs" 31
prefect work-pool set-concurrency-limit "workpool-ecs" 32
This won’t cancel runs; it just resets the limit value and can clear a stale slot counter if that’s the issue. 5) Optional: Preview due work for the pool - Unlike queue preview,
work-pool preview
may work for push pools:
Copy code
prefect work-pool preview "workpool-ecs" -h 2
If it returns due runs but nothing is being submitted and no infra is attempted, we’re definitely hitting an orchestration cap. If you can share: - Output of
prefect deployment inspect "<flow>/<deployment>"
(especially
concurrency_limit
and
work_pool_name
) - A Late run’s
prefect flow-run inspect <id>
state message - Workspace “Concurrent runs” limit and how many were in use at the time - Screenshot/text of Work Pool “Active runs” and “Recent activity” during the quiet window …I’ll map this to an exact fix (raise workspace limit, adjust deployment concurrency, clear/resolve stale runs, or similar). References: - States and reasons: States - Work pools: Work pools - Schedules: Schedules