<@ULVA73B9P> how to handle OOM issue with prefect ...
# marvin-ai
m
@Marvin how to handle OOM issue with prefect ecs worker, my pipelines are on demand and cpu/ram requirement is unpredictable.?
m
thought for 344.3 seconds
Great question — with ECS (EC2 launch type) OOMs usually come from containers hitting their hard memory limit or the EC2 host not having enough free memory to place the task. Here are practical patterns that work well with Prefect 3.x ECS workers when CPU/RAM needs are unpredictable: Key ideas - Set small defaults in your ECS work pool, then override per-run with job variables when you need more. - Add an “escalate on OOM” retry pattern that re-runs the deployment with larger resources. - Control concurrency so your single EC2 box isn’t overcommitted. - If you use EC2 launch type, the task must fit on the host. Larger per-run requests will fail to place if the instance is full. 1) Set conservative defaults, override per-run - In your ECS work pool, keep default `cpu`/`memory` small (e.g.,
cpu=512
,
memory=1024
). - When you trigger a run that needs more, pass job variables to override: CLI
Copy code
prefect deployments run <FLOW_NAME>/<DEPLOYMENT_NAME> \
  -jv cpu=2048 \
  -jv memory=4096 \
  --watch
Python (when deploying)
Copy code
Deployment.from_source(...).deploy(
  name="ecs-small",
  work_pool_name="ecs-ec2-pool",
  job_variables={"cpu": 512, "memory": 1024, "launch_type": "EC2"}
)
You can also set these at run time in the UI (Run deployment -> Job variables). Docs: Customize job variables 2) Create size tiers (small/medium/large) - Make a few deployments of the same flow with different job variable defaults: - ecs-small:
cpu=512, memory=1024
- ecs-medium:
cpu=1024, memory=4096
- ecs-large:
cpu=2048, memory=8192
- Trigger the right one based on what you expect for that run. This is simple, safe, and easy to adopt. 3) Automatic “retry with more memory” on OOM - ECS OOMs usually show exit code 137 and a reason like “OutOfMemoryError.” - Create a Prefect Automation: on flow-run crashed/failed and the message includes “OutOfMemory” or “137”, then “Run deployment” again with higher
job_variables
(e.g., double
memory
and
cpu
). - This turns sporadic OOMs into an automatic retry with bigger resources. 4) Consider memoryReservation (soft limit) vs memory (hard limit) - On EC2 launch type, you can reduce OOM kills by using a soft limit (
memoryReservation
) plus a hard cap (
memory
) in the ECS task/container. - To do this, customize your ECS work pool’s base job template to include
memoryReservation
for the container definition. Set
memoryReservation
to your normal expected use and
memory
to a max. - See Prefect’s ECS guide for the job template structure and supported variables: Prefect ECS guide 5) Guard your single EC2 host - With one 8GB instance running Server + workers + tasks, set a low concurrency so you don’t oversubscribe memory. A common starting point is allowing 1 ECS task at a time for “large” runs, and 2–3 for small/light jobs. - If a task requests more memory than the EC2 host can provide, placement will fail and the ECS worker will keep retrying create; prefer right-sizing or scaling your instance/cluster capacity for production. 6) Operational tips - Look for ECS stoppedReason and exit code (137) to confirm OOM vs other causes. - If you see frequent OOMs even after increasing container memory, check for Python process memory leaks, large in-memory dataframes, or unbounded concurrency in your code. - For production, consider moving Prefect Server/worker to a separate instance (or use Fargate) to prevent orchestration from competing for memory with workloads. If you’d like, share: - Your current ECS work pool cpu/memory defaults - An example failed run’s ECS exit code/stoppedReason - The size(s) you want for small/medium/large I can propose exact job variable presets and a template you can drop into your work pool for EC2. Links again for reference: - Prefect ECS guide
m
@Marvin i am using ecs fargate for both server and ecs worker
m
thought for 63.6 seconds
Got it — since you’re on ECS Fargate for both Prefect Server and the ECS worker, here’s how to minimize OOMs when resource needs are unpredictable. What OOM means on Fargate - Fargate enforces hard memory limits. If your container exceeds the task’s memory, it’s OOM-killed (often exit code 137). - You must pick a valid CPU/memory combo per task. If you choose an invalid combo, the task won’t start. - There’s no memoryReservation soft limit on Fargate (that was an EC2-only tip) — so focus on picking the right hard limits. Recommended patterns 1) Override CPU/memory per run with job variables - Keep small defaults in your ECS work pool (e.g., cpu=512, memory=1024). - For heavy runs, override at trigger time: CLI
Copy code
prefect deployments run <FLOW_NAME>/<DEPLOYMENT_NAME> \
  -jv cpu=2048 \
  -jv memory=4096 \
  --watch
Python (when deploying)
Copy code
Deployment.from_source(...).deploy(
  name="ecs-default",
  work_pool_name="ecs-fargate-pool",
  job_variables={"cpu": 512, "memory": 1024, "launch_type": "FARGATE"}
)
Docs: - Customize job variables: Customize job variables - Prefect ECS guide: Prefect ECS guide - AWS valid Fargate CPU/memory combos: Fargate CPU/memory combinations 2) Create size-tier deployments - Make small/medium/large variants of the same flow, each with different default job variables: - small: cpu=512, memory=1024 - medium: cpu=1024, memory=4096 - large: cpu=2048, memory=8192 - Trigger the variant you need per run. Simple, safe, and predictable with Fargate. 3) “Retry with more memory” automation - OOMs typically return exit code 137 or messages like “OutOfMemory.” - Create a Prefect Automation: on flow run failure/crash, re-run the same deployment with bigger job variables (e.g., double memory/CPU). - This converts sporadic OOMs into a second attempt with more headroom (and keeps your defaults small for the common case). 4) Concurrency guardrails - Because Fargate gives each task its own capacity, you don’t compete for RAM with other tasks on the same host — but you can still overwhelm the account or hit throttles. - Set a low work queue concurrency to avoid many large tasks starting at once. This reduces blast radius and “many-ooms-at-once.” - Concurrency docs: Concurrency limits 5) If disk usage is high, bump ephemeral storage - Fargate default ephemeral storage is 20 GiB; you can raise it up to 200 GiB. If you see disk-related failures, add
ephemeralStorage
in your ECS job template (base job template on the work pool). - This is separate from OOM, but often correlated with memory-heavy pipelines. - AWS ephemeral storage: Fargate ephemeral storage Practical starting point for you - Default deployment: cpu=512, memory=1024 - Medium: cpu=1024, memory=4096 - Large: cpu=2048, memory=8192 - Add an automation that, on failure, re-runs with the next size up. - Keep work queue concurrency to 1–2 while you learn the workload’s profile. If you share: - A sample failed run’s exit code/stopped reason - The memory ranges you expect (best guess) I can propose exact job variable presets and a base job template snippet tailored for Fargate (including optional ephemeral storage).