Ravish Mallya
09/01/2025, 12:44 PMRuntimeError: Failed to run ECS task: RESOURCE:CPU
error?Marvin
09/01/2025, 12:44 PMMarvin
09/01/2025, 12:55 PMprefect work-pool set-concurrency-limit "<your-ecs-pool>" 10
prefect work-pool inspect "<your-ecs-pool>"
Docs:
- Work pools: Concepts: Work pools
- CLI reference: CLI: work-pool
2) Optionally add an automation to reschedule on infra crashes
- If you rely on autoscaling to add capacity shortly after, you can automatically reschedule runs that crash with RESOURCE:CPU.
- In Prefect Cloud, create an Automation:
- Trigger: Flow run enters Crashed (optionally filter message contains “Failed to run ECS task”)
- Action: Set state to Scheduled at “now + 2–5 minutes”
- Add a max attempts guard to prevent infinite loops
Docs: CLI: automation (you can configure via UI or API)
AWS-side tuning (helps reduce, not eliminate, these errors)
- Use ECS capacity providers with managed scaling and set a non-zero ASG min size to keep some warm capacity.
- Make sure task CPU/memory requests fit your instance types (bin-packing issues can cause RESOURCE:CPU even when total capacity looks sufficient).
- If you need near-zero placement failures, consider Fargate for those workloads.
Notes
- There is no user-facing setting today to make the ECS worker “wait until EC2 scales up.” It retries a few times then gives up. The work pool concurrency limit is the reliable way to queue runs inside Prefect so they aren’t submitted to ECS prematurely.
- If you want the concurrency limit to track your cluster’s autoscaling, you can script updates to the pool limit via the Prefect CLI/API when your ASG scales.
If you share:
- Your pool name and desired max concurrent ECS tasks
- Whether you’re on Prefect Cloud or OSS
…I can provide an example automation config and help pick a safe concurrency limit for your cluster.Ravish Mallya
09/01/2025, 4:15 PMMarvin
09/01/2025, 4:15 PMMarvin
09/01/2025, 4:16 PMRESOURCE:CPU
errors.
Docs:
- Work pools overview: Concepts: Work pools
- CLI (set/inspect limits): CLI: work-poolRavish Mallya
09/01/2025, 4:19 PMMarvin
09/01/2025, 4:19 PMMarvin
09/01/2025, 4:20 PMprefect work-pool set-concurrency-limit "<your-ecs-pool>" <new_limit>
- CLI ref: CLI: work-pool
- Use task-level concurrency for the heavy stuff instead of throttling the whole pool
- Tag heavy tasks/flows and create a tag limit so lighter runs can continue to use new capacity:
prefect concurrency-limit create --limit 5 --tag heavy-ecs
- CLI ref: CLI: concurrency-limit
- Split workloads into multiple pools with different limits
- E.g., “cpu-heavy” pool (lower limit) and “standard” pool (higher limit) to better match capacity.
- AWS-side: keep a small warm min capacity and use capacity providers with managed scaling to reduce placement failures; ensure task CPU/memory fit your instance types.
Context on the behavior
- Without a pool limit, the worker submits immediately to ECS and may hit:
RuntimeError: Failed to run ECS task: RESOURCE:CPU
This is a known behavior when EC2 capacity isn’t ready: prefecthq/prefect#16111
- The ECS worker does a few internal retries with backoff, but it will not “wait indefinitely” for EC2 to come online. Prefect-side queuing (via pool/tag concurrency) is the reliable way to wait.
If you share your typical/peak task sizes and ASG scale ranges, I can suggest a starting pool limit and a simple script to auto-adjust it as your cluster scales.