<@ULVA73B9P> I've got a Prefect Cloud AWS ECS push...
# marvin-ai
j
@Marvin I've got a Prefect Cloud AWS ECS push pool setup with a high success rate but I'm occasionally getting crashes with the following error.
Copy code
Flow run could not be submitted to infrastructure: Failed to create ECS task run: [{'reason': 'Capacity is unavailable at this time. Please try again later or in a different availability zone'}]
Assuming I can't mitigate this by adjusting my availability zones on AWS what options does Prefect offer with dealing with this? Is setting up retries my best bet or is there other ways?
m
thought for 451.5 seconds
You’re running into a transient AWS/Fargate capacity error. With a push work pool, Prefect asks ECS to start the task; if ECS rejects the request, the flow run will move to CRASHED. There isn’t a built-in retry at the push-pool submission layer, so you’ll want to add resiliency at the orchestration level. Recommended options (Prefect 3.x): 1) Flow-level retries (most direct) - Add retries on the flow so Prefect automatically reschedules the run after a crash. Include a delay and a bit of jitter to avoid “thundering herd” resubmissions.
Copy code
from prefect import flow

@flow(
    retries=5,
    retry_delay_seconds=120,     # give ECS time to regain capacity
    retry_jitter_factor=0.3,     # randomize the delay slightly
)
def my_flow():
    ...
- This will resubmit the entire flow run if the previous attempt CRASHED or FAILED. - Docs: Flow retries 2) Smooth out submission bursts with work pool/queue concurrency limits - If these errors happen when many runs start at once, set a concurrency limit so Prefect only submits up to N runs at a time. Additional runs will wait for a slot instead of all hitting ECS simultaneously.
Copy code
prefect work-pool set-concurrency-limit "my-ecs-push-pool" 5
# or per-queue
prefect work-queue set-concurrency-limit "my-queue" 5 --pool "my-ecs-push-pool"
- Docs: Work pools and concurrency limits 3) Automations as a fallback/resubmission pattern - Create an Automation that triggers on flow-run.Crashed (scoped to your deployment or pool) and takes a Run Deployment action to kick off a new run. This is useful if you want conditional logic, notifications, or to route retries to a “fallback” deployment. - Docs: - Creating automations - Event triggers - RunDeployment action Notes and tradeoffs - Push pools don’t auto-retry the ECS submission; the flow run will CRASH if ECS rejects it. Flow-level retries are the simplest, most reliable way to recover. - Concurrency limits help if capacity errors correlate with spikes. If errors occur even at low volume, stick with flow-level retries (and consider a longer retry delay). - Ensure your flow is idempotent before enabling retries since the entire flow will re-run. AWS-specific mitigations (non-Prefect) - Smaller CPU/memory per task, different capacity providers, or additional subnets/AZs usually help, but you noted those aren’t an option. If you share whether these errors cluster around start-time spikes, I can suggest concrete concurrency values and a retry strategy tailored to your cadence. Helpful references: - Push/serverless pools overview: Serverless (push) pools - ECS guide: Prefect AWS ECS guide
j
@Marvin Aye thats what I was expecting thanks. We've already got sensible concurrency limit, its likely there is just less capacity in general for the high vcpu tasks we're running. Back to the retries, I was hoping to only retry on this specific crash and not other failure. Is automations only the way to achieve this? I as kinda hoping to avoid setting that up but that could work if its the only solution.
m
thought for 208.0 seconds
Got it — and that makes sense for high-vCPU Fargate tasks. Short answer: - Flow-level retries can’t be scoped to “only retry this specific crash.” They’ll retry on any Failed or Crashed state. - Automations are the built-in way to target CRASHED runs, but they can’t filter by a specific provider error message either. - If “retry on any crash” is acceptable, an Automation is the cleanest solution. If you truly need “only capacity errors,” you’ll need a small bit of glue code outside Prefect. What you can do: Option A: Retry only when the flow run CRASHES (but not on Failed) - Create an Automation: - Trigger: event =
prefect.flow-run.Crashed
- Scope: your specific deployment (or work pool/queue) - Action: Run Deployment (the same deployment) - This avoids retrying on “Failed” runs (i.e., failures inside your code), but it will retry on any CRASHED cause (image pull error, IAM issues, etc.), not just ECS capacity. - Docs: - Creating automations - Event triggers Option B: Only retry when the crash message contains “Capacity is unavailable…” - Prefect Automations can’t filter on the state message. To do this precisely: - Trigger:
prefect.flow-run.Crashed
for your deployment - Action: call a small webhook/service you own that: 1) inspects the event payload via Prefect API (or reads the included state message) 2) if it contains your capacity string, call Prefect’s API to run the deployment 3) otherwise do nothing - This keeps logic out of your flows and avoids blanket retries. It’s a bit of glue, but minimal and robust. Option C: Accept unconditional retries but add guardrails - If you can live with “retry all Failed and Crashed,” use flow-level retries:
Copy code
from prefect import flow

@flow(retries=3, retry_delay_seconds=180, retry_jitter_factor=0.3)
def my_flow():
    ...
- This is the simplest, but will retry application failures too. Extra safeguards for any automation approach - Add a threshold/window to prevent loops (e.g., “trigger only if ≤1 crash within 10 minutes”). See thresholds in automations. - Consider a separate “retry” deployment (optionally with smaller CPU/mem request or different pool) and have the Automation run that deployment instead of the original. This avoids re-trigger storms and lets you alter the retry posture. References - Push/serverless pools: Serverless (push) pools - Flow retries: Flow retries - Automations: Creating automations and Event triggers If “retry on any CRASHED but not FAILED” is good enough, I can outline the exact Automation configuration to set up in a minute. If you really need the exact capacity message match, I can sketch the tiny webhook that performs the check and triggers a rerun.