Madushan Pathirana
09/25/2025, 1:22 PMMarvin
09/25/2025, 1:22 PMMarvin
09/25/2025, 1:27 PMcpu=512
, memory=1024
).
- When you trigger a run that needs more, pass job variables to override:
CLI
prefect deployments run <FLOW_NAME>/<DEPLOYMENT_NAME> \
-jv cpu=2048 \
-jv memory=4096 \
--watch
Python (when deploying)
Deployment.from_source(...).deploy(
name="ecs-small",
work_pool_name="ecs-ec2-pool",
job_variables={"cpu": 512, "memory": 1024, "launch_type": "EC2"}
)
You can also set these at run time in the UI (Run deployment -> Job variables).
Docs: Customize job variables
2) Create size tiers (small/medium/large)
- Make a few deployments of the same flow with different job variable defaults:
- ecs-small: cpu=512, memory=1024
- ecs-medium: cpu=1024, memory=4096
- ecs-large: cpu=2048, memory=8192
- Trigger the right one based on what you expect for that run. This is simple, safe, and easy to adopt.
3) Automatic “retry with more memory” on OOM
- ECS OOMs usually show exit code 137 and a reason like “OutOfMemoryError.”
- Create a Prefect Automation: on flow-run crashed/failed and the message includes “OutOfMemory” or “137”, then “Run deployment” again with higher job_variables
(e.g., double memory
and cpu
).
- This turns sporadic OOMs into an automatic retry with bigger resources.
4) Consider memoryReservation (soft limit) vs memory (hard limit)
- On EC2 launch type, you can reduce OOM kills by using a soft limit (memoryReservation
) plus a hard cap (memory
) in the ECS task/container.
- To do this, customize your ECS work pool’s base job template to include memoryReservation
for the container definition. Set memoryReservation
to your normal expected use and memory
to a max.
- See Prefect’s ECS guide for the job template structure and supported variables: Prefect ECS guide
5) Guard your single EC2 host
- With one 8GB instance running Server + workers + tasks, set a low concurrency so you don’t oversubscribe memory. A common starting point is allowing 1 ECS task at a time for “large” runs, and 2–3 for small/light jobs.
- If a task requests more memory than the EC2 host can provide, placement will fail and the ECS worker will keep retrying create; prefer right-sizing or scaling your instance/cluster capacity for production.
6) Operational tips
- Look for ECS stoppedReason and exit code (137) to confirm OOM vs other causes.
- If you see frequent OOMs even after increasing container memory, check for Python process memory leaks, large in-memory dataframes, or unbounded concurrency in your code.
- For production, consider moving Prefect Server/worker to a separate instance (or use Fargate) to prevent orchestration from competing for memory with workloads.
If you’d like, share:
- Your current ECS work pool cpu/memory defaults
- An example failed run’s ECS exit code/stoppedReason
- The size(s) you want for small/medium/large
I can propose exact job variable presets and a template you can drop into your work pool for EC2. Links again for reference:
- Prefect ECS guideMarvin
09/25/2025, 1:27 PMMadushan Pathirana
09/25/2025, 2:38 PMMarvin
09/25/2025, 2:38 PMMarvin
09/25/2025, 2:39 PMprefect deployments run <FLOW_NAME>/<DEPLOYMENT_NAME> \
-jv cpu=2048 \
-jv memory=4096 \
--watch
Python (when deploying)
Deployment.from_source(...).deploy(
name="ecs-default",
work_pool_name="ecs-fargate-pool",
job_variables={"cpu": 512, "memory": 1024, "launch_type": "FARGATE"}
)
Docs:
- Customize job variables: Customize job variables
- Prefect ECS guide: Prefect ECS guide
- AWS valid Fargate CPU/memory combos: Fargate CPU/memory combinations
2) Create size-tier deployments
- Make small/medium/large variants of the same flow, each with different default job variables:
- small: cpu=512, memory=1024
- medium: cpu=1024, memory=4096
- large: cpu=2048, memory=8192
- Trigger the variant you need per run. Simple, safe, and predictable with Fargate.
3) “Retry with more memory” automation
- OOMs typically return exit code 137 or messages like “OutOfMemory.”
- Create a Prefect Automation: on flow run failure/crash, re-run the same deployment with bigger job variables (e.g., double memory/CPU).
- This converts sporadic OOMs into a second attempt with more headroom (and keeps your defaults small for the common case).
4) Concurrency guardrails
- Because Fargate gives each task its own capacity, you don’t compete for RAM with other tasks on the same host — but you can still overwhelm the account or hit throttles.
- Set a low work queue concurrency to avoid many large tasks starting at once. This reduces blast radius and “many-ooms-at-once.”
- Concurrency docs: Concurrency limits
5) If disk usage is high, bump ephemeral storage
- Fargate default ephemeral storage is 20 GiB; you can raise it up to 200 GiB. If you see disk-related failures, add ephemeralStorage
in your ECS job template (base job template on the work pool).
- This is separate from OOM, but often correlated with memory-heavy pipelines.
- AWS ephemeral storage: Fargate ephemeral storage
Practical starting point for you
- Default deployment: cpu=512, memory=1024
- Medium: cpu=1024, memory=4096
- Large: cpu=2048, memory=8192
- Add an automation that, on failure, re-runs with the next size up.
- Keep work queue concurrency to 1–2 while you learn the workload’s profile.
If you share:
- A sample failed run’s exit code/stopped reason
- The memory ranges you expect (best guess)
I can propose exact job variable presets and a base job template snippet tailored for Fargate (including optional ephemeral storage).