Ravish Mallya
08/27/2025, 4:06 PMFailed to run ECS task: RESOURCE:CPU
, ideally the asg should have handled this automatically isnt it?, what triggers the scaleMarvin
08/27/2025, 4:06 PMMarvin
08/27/2025, 4:07 PMRESOURCE:CPU
because there wasn’t enough EC2 capacity at the moment of placement.
What’s going on
- RESOURCE:CPU
means the ECS scheduler couldn’t find enough free CPU on any container instance to place the task. It doesn’t queue them by default; the placement just fails.
- An Auto Scaling Group only scales when a scaling policy/alarm triggers. Typical out-of-the-box ASG target tracking on EC2 instance CPU won’t help here—CPU on your instances was not high because those tasks never got placed.
- ECS will only “auto-scale the EC2 capacity” for you if you wire up ECS Capacity Providers with Managed Scaling and use them with an ECS Service. One-off RunTask
calls generally do not drive managed scale-out in time (and managed scaling primarily acts on services’ desired count).
- Scale-out also takes time (launch + bootstrap), so even with correct wiring you need retries so tasks can place after new capacity arrives.
What actually triggers scale
- Without capacity providers: whatever CloudWatch alarms/target-tracking you attached to the ASG (e.g., instance CPU, custom metrics). If you haven’t configured any, nothing will trigger.
- With ECS Capacity Providers + Managed Scaling on a Service: ECS attaches target-tracking policies to the ASG based on the capacity provider’s utilization target; ECS will scale the ASG when service desired tasks cannot be placed. See ECS Capacity Providers and Managed Scaling and ECS Service Auto Scaling.
Ways to fix this
- Easiest: use Fargate launch type so you don’t manage EC2 capacity. No ASG to scale, and your 20 tasks will just start (subject to Fargate account limits).
- If you must use EC2:
- Use an ECS Service with a Capacity Provider (Managed Scaling on) and submit work by changing the service desired count or using queue-driven service autoscaling. This allows ECS to scale the ASG proactively.
- If you stick with RunTask
, add proactive scaling signals for the ASG (e.g., target tracking on ECS cluster `CPUReservation`/`MemoryReservation`, or step scaling on a queue depth metric), and implement retries/backoff so tasks that initially fail can be resubmitted once new instances come up.
- Consider placement strategies to reduce CPU fragmentation (e.g., binpack
on cpu
). If each task requests a specific cpu
value, fragmentation can prevent placement even when “total” CPU seems sufficient. See Task placement strategies.
- Pre-scale the ASG (temporarily bump desired capacity) ahead of expected bursts.
- Also verify task CPU/memory requests and instance types align; overly large `cpu`/`memory` settings per task can make placement much harder.
If you’re launching these via Prefect (3.x) ECS task runs
- Prefer Fargate for bursty workloads.
- If running on EC2, you can specify a capacity provider strategy in your ECS infrastructure so runs go to a capacity provider backed by an ASG with Managed Scaling, but you’ll still want retries/backoff for bursts.
A couple of clarifying questions to guide next steps:
- Are you running these as one-off RunTask
calls (e.g., batch/burst jobs) or as part of an ECS Service?
- Do you have an ECS Capacity Provider with Managed Scaling attached to your cluster/ASG?
- If you’re using Prefect to launch these, which launch type are you using (Fargate vs EC2), and do you have retries enabled for run placement failures?
Helpful docsMarvin
08/27/2025, 4:07 PMRESOURCE:CPU
troubleshooting: Troubleshooting guide