Jordan Fisher
10/24/2025, 10:39 AMFlow run could not be submitted to infrastructure: Failed to create ECS task run: [{'reason': 'Capacity is unavailable at this time. Please try again later or in a different availability zone'}]
Assuming I can't mitigate this by adjusting my availability zones on AWS what options does Prefect offer with dealing with this? Is setting up retries my best bet or is there other ways?Marvin
10/24/2025, 10:39 AMMarvin
10/24/2025, 10:47 AMfrom prefect import flow
@flow(
retries=5,
retry_delay_seconds=120, # give ECS time to regain capacity
retry_jitter_factor=0.3, # randomize the delay slightly
)
def my_flow():
...
- This will resubmit the entire flow run if the previous attempt CRASHED or FAILED.
- Docs: Flow retries
2) Smooth out submission bursts with work pool/queue concurrency limits
- If these errors happen when many runs start at once, set a concurrency limit so Prefect only submits up to N runs at a time. Additional runs will wait for a slot instead of all hitting ECS simultaneously.
prefect work-pool set-concurrency-limit "my-ecs-push-pool" 5
# or per-queue
prefect work-queue set-concurrency-limit "my-queue" 5 --pool "my-ecs-push-pool"
- Docs: Work pools and concurrency limits
3) Automations as a fallback/resubmission pattern
- Create an Automation that triggers on flow-run.Crashed (scoped to your deployment or pool) and takes a Run Deployment action to kick off a new run. This is useful if you want conditional logic, notifications, or to route retries to a “fallback” deployment.
- Docs:
- Creating automations
- Event triggers
- RunDeployment action
Notes and tradeoffs
- Push pools don’t auto-retry the ECS submission; the flow run will CRASH if ECS rejects it. Flow-level retries are the simplest, most reliable way to recover.
- Concurrency limits help if capacity errors correlate with spikes. If errors occur even at low volume, stick with flow-level retries (and consider a longer retry delay).
- Ensure your flow is idempotent before enabling retries since the entire flow will re-run.
AWS-specific mitigations (non-Prefect)
- Smaller CPU/memory per task, different capacity providers, or additional subnets/AZs usually help, but you noted those aren’t an option.
If you share whether these errors cluster around start-time spikes, I can suggest concrete concurrency values and a retry strategy tailored to your cadence.
Helpful references:
- Push/serverless pools overview: Serverless (push) pools
- ECS guide: Prefect AWS ECS guideJordan Fisher
10/24/2025, 10:53 AMMarvin
10/24/2025, 10:53 AMMarvin
10/24/2025, 10:56 AMprefect.flow-run.Crashed
- Scope: your specific deployment (or work pool/queue)
- Action: Run Deployment (the same deployment)
- This avoids retrying on “Failed” runs (i.e., failures inside your code), but it will retry on any CRASHED cause (image pull error, IAM issues, etc.), not just ECS capacity.
- Docs:
- Creating automations
- Event triggers
Option B: Only retry when the crash message contains “Capacity is unavailable…”
- Prefect Automations can’t filter on the state message. To do this precisely:
- Trigger: prefect.flow-run.Crashed for your deployment
- Action: call a small webhook/service you own that:
1) inspects the event payload via Prefect API (or reads the included state message)
2) if it contains your capacity string, call Prefect’s API to run the deployment
3) otherwise do nothing
- This keeps logic out of your flows and avoids blanket retries. It’s a bit of glue, but minimal and robust.
Option C: Accept unconditional retries but add guardrails
- If you can live with “retry all Failed and Crashed,” use flow-level retries:
from prefect import flow
@flow(retries=3, retry_delay_seconds=180, retry_jitter_factor=0.3)
def my_flow():
...
- This is the simplest, but will retry application failures too.
Extra safeguards for any automation approach
- Add a threshold/window to prevent loops (e.g., “trigger only if ≤1 crash within 10 minutes”). See thresholds in automations.
- Consider a separate “retry” deployment (optionally with smaller CPU/mem request or different pool) and have the Automation run that deployment instead of the original. This avoids re-trigger storms and lets you alter the retry posture.
References
- Push/serverless pools: Serverless (push) pools
- Flow retries: Flow retries
- Automations: Creating automations and Event triggers
If “retry on any CRASHED but not FAILED” is good enough, I can outline the exact Automation configuration to set up in a minute. If you really need the exact capacity message match, I can sketch the tiny webhook that performs the check and triggers a rerun.