Madushan Pathirana
09/05/2025, 2:41 PMMarvin
09/05/2025, 2:41 PMMarvin
09/05/2025, 2:45 PMprefect flow-run inspect <flow_run_id>
prefect flow-run logs <flow_run_id>
2) Check the worker logs around the crash
- If you start a worker manually, run it with debug logs:
prefect worker start -p <your-pool> --log-level DEBUG
- If running under systemd:
journalctl -u prefect-worker -e --since "2025-09-05 10:00"
- Look for messages like “received SIGTERM”, “shutting down”, or exceptions preceding termination.
Given your setup (single EC2 host running Prefect Server + workers, ~8GB RAM, RDS Postgres):
3) Check for EC2/OS memory pressure (most common on single-node setups)
- Kernel/OS OOM killer:
dmesg -T | grep -i -E "oom|killed process"
sudo journalctl -k -r | grep -i -E "oom|killed process"
- If you see OOM events, reduce concurrency, optimize memory usage, or upgrade the instance.
- Monitor memory/CPU with top/htop or CloudWatch and align worker concurrency accordingly.
4) Check for host restarts or system events
- Instance uptime vs. crash time:
uptime -s
last -x | head
- CloudWatch “Status check failed” or “Rebooted” events near the crash time.
If you’re using containers (optional, if applicable):
5) Docker
- Inspect the container that ran the flow:
docker ps -a --no-trunc
docker inspect <container_id> | grep -i -E "OOMKilled|ExitCode"
docker logs <container_id>
- OOMKilled=true or ExitCode=137 indicates memory kill/termination.
6) Kubernetes
- Pod events:
kubectl describe pod <pod> -n <ns> | grep -i -E "Reason:|OOMKilled|Evicted"
kubectl logs <pod> -n <ns> --previous
Check connectivity/heartbeats:
7) Network/API interruptions
- If the API (Prefect Server) was down or unreachable from the worker for a while, you can see errors in worker logs (request failures, heartbeat errors). Since your Server and worker are on the same EC2, look for server restarts in its logs and OS logs.
Make it easier next time
- Run workers with debug logs:
prefect worker start -p <pool> --log-level DEBUG
- Limit concurrency to fit 8GB RAM:
prefect work-pool set-concurrency-limit <pool> 1
or set lower PREFECT__WORKER__CONCURRENCY
in your worker environment.
- Add host monitoring (CloudWatch) for memory, CPU, and OOM events.
- If you’re using Docker/K8s, set explicit memory limits that match the instance capacity to avoid kernel OOMs affecting the whole box.
Helpful docs
- States overview: Prefect 3.x states
- Worker health checks: Worker healthchecks
- Detecting “zombie”/stalled runs: Detect zombie flows
A couple of quick questions so I can tailor the next steps:
- Which work pool type are you using (process, docker, kubernetes)?
- Did any other services on the EC2 host restart around the same time (e.g., Postgres client errors, Prefect Server logs)?Marvin
09/05/2025, 2:45 PM