<@ULVA73B9P> in ecs+ec2 setup, how make my prefect...
# ask-marvin
r
@Marvin in ecs+ec2 setup, how make my prefect workers to queue the tasks if there is no resource available instead of the causing
RuntimeError: Failed to run ECS task: RESOURCE:CPU
error?
m
thought for 643.9 seconds
Short answer: add a concurrency limit to your ECS work pool so Prefect will queue runs before they hit ECS. The ECS worker does a few internal retries, but it won’t “wait indefinitely” for EC2 capacity—without a limit, it will eventually raise the RESOURCE:CPU error. What’s happening today - In Prefect 3.x, the ECS worker calls RunTask. If ECS returns RESOURCE:CPU/MEMORY, the worker retries a few times with jittered backoff, then marks the flow run crashed and surfaces: RuntimeError: Failed to run ECS task: RESOURCE:CPU - This is a known behavior with EC2-backed clusters that are still scaling up. See the issue: prefecthq/prefect#16111 How to “queue” until capacity is available 1) Set a work pool concurrency limit (recommended) - Cap how many flow runs Prefect will start from that pool at once. Extra runs stay queued (AwaitingResource) and won’t call ECS until a slot frees up. - Pick a limit at or below your cluster’s steady-state capacity to avoid overfilling ECS. Commands:
Copy code
prefect work-pool set-concurrency-limit "<your-ecs-pool>" 10
prefect work-pool inspect "<your-ecs-pool>"
Docs: - Work pools: Concepts: Work pools - CLI reference: CLI: work-pool 2) Optionally add an automation to reschedule on infra crashes - If you rely on autoscaling to add capacity shortly after, you can automatically reschedule runs that crash with RESOURCE:CPU. - In Prefect Cloud, create an Automation: - Trigger: Flow run enters Crashed (optionally filter message contains “Failed to run ECS task”) - Action: Set state to Scheduled at “now + 2–5 minutes” - Add a max attempts guard to prevent infinite loops Docs: CLI: automation (you can configure via UI or API) AWS-side tuning (helps reduce, not eliminate, these errors) - Use ECS capacity providers with managed scaling and set a non-zero ASG min size to keep some warm capacity. - Make sure task CPU/memory requests fit your instance types (bin-packing issues can cause RESOURCE:CPU even when total capacity looks sufficient). - If you need near-zero placement failures, consider Fargate for those workloads. Notes - There is no user-facing setting today to make the ECS worker “wait until EC2 scales up.” It retries a few times then gives up. The work pool concurrency limit is the reliable way to queue runs inside Prefect so they aren’t submitted to ECS prematurely. - If you want the concurrency limit to track your cluster’s autoscaling, you can script updates to the pool limit via the Prefect CLI/API when your ASG scales. If you share: - Your pool name and desired max concurrent ECS tasks - Whether you’re on Prefect Cloud or OSS …I can provide an example automation config and help pick a safe concurrency limit for your cluster.
r
@Marvin but concurrency means wait for x runs to complete before scheduling next x runs na?
m
thought for 21.6 seconds
Yes—effectively, but it’s rolling, not batched. - A work pool concurrency limit of X means at most X flow runs from that pool can be “in flight” at once. - When the pool is at its limit, additional runs stay queued (AwaitingResource) and are not submitted to ECS. - As soon as one of the X finishes, the next queued run is dispatched—no need to wait for all X to complete. This keeps Prefect from submitting runs to ECS when your cluster can’t place them, preventing the
RESOURCE:CPU
errors. Docs: - Work pools overview: Concepts: Work pools - CLI (set/inspect limits): CLI: work-pool
r
but it is possible that with this setting some tasks are queued and the ecs manages to scale up, but the tasks dont get scheduled to new machines since the older tasks are still running isnt it @Marvin
m
thought for 47.9 seconds
You’re right — if you cap the work pool too low, Prefect will keep runs queued even if ECS scales up and has room. That’s by design: the pool limit is a hard ceiling on “in-flight” runs from Prefect. Ways to avoid underutilization while still preventing RESOURCE:CPU errors: - Set the pool limit ≈ your steady-state cluster capacity - Tune it so it reflects what your EC2 cluster can actually place most of the time. - If your ASG scales up/down, update the limit dynamically from your infra (see below). - Dynamically adjust the pool limit when your ASG scales - Wire a small script (triggered by ASG/CloudWatch events or a cron) to bump the pool limit up/down:
Copy code
prefect work-pool set-concurrency-limit "<your-ecs-pool>" <new_limit>
- CLI ref: CLI: work-pool - Use task-level concurrency for the heavy stuff instead of throttling the whole pool - Tag heavy tasks/flows and create a tag limit so lighter runs can continue to use new capacity:
Copy code
prefect concurrency-limit create --limit 5 --tag heavy-ecs
- CLI ref: CLI: concurrency-limit - Split workloads into multiple pools with different limits - E.g., “cpu-heavy” pool (lower limit) and “standard” pool (higher limit) to better match capacity. - AWS-side: keep a small warm min capacity and use capacity providers with managed scaling to reduce placement failures; ensure task CPU/memory fit your instance types. Context on the behavior - Without a pool limit, the worker submits immediately to ECS and may hit: RuntimeError: Failed to run ECS task: RESOURCE:CPU This is a known behavior when EC2 capacity isn’t ready: prefecthq/prefect#16111 - The ECS worker does a few internal retries with backoff, but it will not “wait indefinitely” for EC2 to come online. Prefect-side queuing (via pool/tag concurrency) is the reliable way to wait. If you share your typical/peak task sizes and ASG scale ranges, I can suggest a starting pool limit and a simple script to auto-adjust it as your cluster scales.