< Marvin> Finished in state Crashed Execution was cancelled Prefect Community #ask-marvin

<@ULVA73B9P> Finished in state Crashed('Execution ...

Madushan Pathirana

09/05/2025, 2:41 PM

@Marvin Finished in state Crashed('Execution was cancelled by the runtime environment.') why is this, how to debug the reason

Marvin

09/05/2025, 2:41 PM

✅ thought for 224.3 seconds

Marvin

09/05/2025, 2:45 PM

Short answer: Prefect didn’t crash your code; something outside the Python process stopped it. That state means the runtime environment (worker/container/VM) terminated the flow process (e.g., SIGTERM/SIGKILL, OOMKilled, node restart), so the engine reported “Execution was cancelled by the runtime environment.” Common causes - Worker was stopped/restarted (manual stop, systemd restart, crash) - Host/infra terminated the process - EC2 instance rebooted or had a system event - OS OOM killer killed the process under memory pressure - Container/pod was evicted or scaled down (Docker/Kubernetes) - Lost heartbeats / runtime died without clean shutdown (network/API outage, process hang) - Someone sent a signal (SIGTERM) to the process How to debug and find the root cause Start with Prefect context: 1) Correlate the exact crash time - In UI: open the flow run page and note the final log timestamps. - CLI:

Copy code

prefect flow-run inspect <flow_run_id>
  prefect flow-run logs <flow_run_id>

2) Check the worker logs around the crash - If you start a worker manually, run it with debug logs:

Copy code

prefect worker start -p <your-pool> --log-level DEBUG

- If running under systemd:

Copy code

journalctl -u prefect-worker -e --since "2025-09-05 10:00"

- Look for messages like “received SIGTERM”, “shutting down”, or exceptions preceding termination. Given your setup (single EC2 host running Prefect Server + workers, ~8GB RAM, RDS Postgres): 3) Check for EC2/OS memory pressure (most common on single-node setups) - Kernel/OS OOM killer:

Copy code

dmesg -T | grep -i -E "oom|killed process"
  sudo journalctl -k -r | grep -i -E "oom|killed process"

- If you see OOM events, reduce concurrency, optimize memory usage, or upgrade the instance. - Monitor memory/CPU with top/htop or CloudWatch and align worker concurrency accordingly. 4) Check for host restarts or system events - Instance uptime vs. crash time:

Copy code

uptime -s
  last -x | head

- CloudWatch “Status check failed” or “Rebooted” events near the crash time. If you’re using containers (optional, if applicable): 5) Docker - Inspect the container that ran the flow:

Copy code

docker ps -a --no-trunc
  docker inspect <container_id> | grep -i -E "OOMKilled|ExitCode"
  docker logs <container_id>

- OOMKilled=true or ExitCode=137 indicates memory kill/termination. 6) Kubernetes - Pod events:

Copy code

kubectl describe pod <pod> -n <ns> | grep -i -E "Reason:|OOMKilled|Evicted"
  kubectl logs <pod> -n <ns> --previous

Check connectivity/heartbeats: 7) Network/API interruptions - If the API (Prefect Server) was down or unreachable from the worker for a while, you can see errors in worker logs (request failures, heartbeat errors). Since your Server and worker are on the same EC2, look for server restarts in its logs and OS logs. Make it easier next time - Run workers with debug logs:

Copy code

prefect worker start -p <pool> --log-level DEBUG

- Limit concurrency to fit 8GB RAM:

Copy code

prefect work-pool set-concurrency-limit <pool> 1

or set lower

PREFECT__WORKER__CONCURRENCY

in your worker environment. - Add host monitoring (CloudWatch) for memory, CPU, and OOM events. - If you’re using Docker/K8s, set explicit memory limits that match the instance capacity to avoid kernel OOMs affecting the whole box. Helpful docs - States overview: Prefect 3.x states - Worker health checks: Worker healthchecks - Detecting “zombie”/stalled runs: Detect zombie flows A couple of quick questions so I can tailor the next steps: - Which work pool type are you using (process, docker, kubernetes)? - Did any other services on the EC2 host restart around the same time (e.g., Postgres client errors, Prefect Server logs)?

Marvin

09/05/2025, 2:45 PM

- Approximately how much memory does your flow use, and what is your worker concurrency? If you can share a flow run ID and a snippet of worker logs around the crash time, I can help pinpoint the root cause.

3 Views

Open in Slack

Previous Next