Sophia Ponte
10/13/2025, 10:28 PMMarvin
10/13/2025, 10:28 PMMarvin
10/13/2025, 10:35 PMkubectl describe pod and kubectl get events.
- Roughly how many task runs/nodes are in this flow run?
- What resources/limits are set on your API pod in EKS (CPU/memory)?
- Any stack traces or errors from the API logs when you load the graph?
Immediate mitigations
1) Reduce graph size the server will try to build
- Prefect 3.x exposes server-side caps for the graph:
- PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES (default 10000)
- PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS (default 10000)
- Lower these significantly to protect the API from excessive memory/CPU on huge runs. For example:
# In your API deployment/pod env vars
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=1000
- These map to ServerFlowRunGraphSettings; see:
- Settings reference
- ServerFlowRunGraphSettings
Note: With caps, the UI graph may be truncated rather than crashing the API — that’s usually a better UX than a server restart.
2) Temporarily avoid loading the Graph tab for that run
- Open the flow run page directly on the Timeline or Tasks tab instead of Graph to avoid triggering the heavy graph call. For example:
- Navigate to the run and change the tab query param to ?tab=timeline or click Timeline first before landing on Graph.
3) Increase API pod resources
- If you consistently need to visualize very large graphs, increase CPU/memory requests/limits for the API pod (e.g., 2–4 vCPU and 4–8Gi memory depending on your flow sizes).
- Verify that the pod is not throttled on CPU.
4) Upgrade to a newer 3.4.x patch
- You’re on 3.4.5; there have been many performance and stability fixes since then. It’s worth upgrading to the latest 3.4.x available to you. Release notes:
- Prefect releases
Concrete debugging steps
- Turn up server logging to see what the API is doing when the graph loads:
PREFECT_SERVER_LOGGING_LEVEL=DEBUG
- Reproduce the API call without the UI to isolate behavior. From a shell where the API is reachable:
# Replace <RUN_ID> and <API_URL>
curl -v "<API_URL>/api/flow_runs/<RUN_ID>/graph"
# or sometimes /graph_v2 depending on UI; check your server routes
Watch the API logs while running that request. If the pod dies, check:
- kubectl logs <api-pod> -n <namespace> --previous
- kubectl describe pod <api-pod> for OOMKilled or restart reason
- Postgres logs for long-running queries (optional)
Kubernetes/Helm examples
- If you’re using the Helm chart, add env vars to the API container (values might look like):
server:
api:
extraEnv:
- name: PREFECT_SERVER_LOGGING_LEVEL
value: DEBUG
- name: PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES
value: "3000"
- name: PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS
value: "1000"
resources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "4000m"
memory: "8Gi"
Adjust to your environment.
If this is only a single pathological run
- Consider whether that run created an unusually large number of artifacts. Lowering MAX_ARTIFACTS helps a lot if artifacts are the bulk.Marvin
10/13/2025, 10:35 PMSophia Ponte
10/13/2025, 10:40 PMEvents:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 15m (x18 over 161m) kubelet Container image "prefecthq/prefect:3.4.4-python3.11" already present on machine
Normal Created 15m (x18 over 161m) kubelet Created container: prefect-server
Normal Started 15m (x18 over 161m) kubelet Started container prefect-server
Warning BackOff 14m (x47 over 145m) kubelet Back-off restarting failed container prefect-server in pod prefect3-server-7b9cb995dc-nwqr8_prefect(40805481-d673-4a3d-b1b7-543c1125aa09)Marvin
10/13/2025, 10:40 PMMarvin
10/13/2025, 10:40 PM# Replace namespace/pod/container names as needed
kubectl logs prefect3-server-7b9cb995dc-nwqr8 -n prefect -c prefect-server --previous
- If that’s empty, try without --previous to see current logs while reproducing:
kubectl logs -f prefect3-server-7b9cb995dc-nwqr8 -n prefect -c prefect-server
- Also check events for probe failures:
kubectl describe pod prefect3-server-7b9cb995dc-nwqr8 -n prefect
Please paste any stack trace you see from those logs.
2) Cap graph size on the server (prevents pathological requests)
Set these env vars on the API container, then restart:
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=1000
This forces the server to truncate very large graphs instead of attempting to build/serialize an enormous payload that can destabilize the process.
- Docs: ServerFlowRunGraphSettings
- Settings reference: Settings reference
3) Turn up server logging to catch errors
Temporarily enable debug logs and (optionally) SQL echo:
PREFECT_SERVER_LOGGING_LEVEL=DEBUG
PREFECT_SERVER_DATABASE_ECHO=true
Reproduce once, then turn these back down.
4) Tune or relax probes while testing
If liveness/readiness probes are too aggressive, they can restart the pod while it’s busy building the graph. Temporarily increase:
- initialDelaySeconds
- timeoutSeconds
- periodSeconds
- failureThreshold
Example (conceptually): failureThreshold 10, timeoutSeconds 5, periodSeconds 10. Reproduce the issue to confirm the server stays up long enough to emit logs.
5) Avoid triggering the heavy endpoint while you investigate
- Open the flow run on the Timeline or Tasks tab first (avoid Graph). You can hit a direct URL that lands on Timeline (if your UI supports a tab query param) or just be careful not to click Graph for that run.
6) Scale and resource considerations
- Scale the API Deployment to 2+ replicas so a single heavy request doesn’t take down availability for everyone.
- Ensure you’ve given the API adequate CPU/memory and that it’s not CPU-throttled.
7) Version mismatch and upgrade
- You mentioned 3.4.5, but the pod shows image 3.4.4-python3.11. Please align to a single version and consider upgrading to the latest 3.4.x patch — there have been many stability/performance improvements since 3.4.4. Releases: Prefect releases
8) Reproduce the graph call outside the UI (optional, for isolation)
Run the graph endpoint manually and watch server logs:
# Replace <API_URL> and <RUN_ID>
curl -v "<API_URL>/api/flow_runs/<RUN_ID>/graph"
# Some versions may use /graph_v2 — check your server logs/routes
If you can share:
- The output of kubectl logs ... --previous from the server pod when it crashes
- Your current liveness/readiness probe settings
- API container resources/limits
- Approximate node count and whether the run created lots of artifacts
…I can recommend exact caps and probe/resource values.
In the meantime, adding these env vars and avoiding the Graph tab for that run should stop the crash loop:
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=1000
PREFECT_SERVER_LOGGING_LEVEL=DEBUGSophia Ponte
10/13/2025, 10:44 PM22:42:27.471 | DEBUG | prefect.server.services.cancellationcleanup - About to run CancellationCleanup...
22:42:27.475 | DEBUG | prefect.server.services.marklateruns - About to run MarkLateRuns...
22:42:27.478 | DEBUG | prefect.server.services.failexpiredpauses - About to run FailExpiredPauses...
22:42:27.480 | DEBUG | prefect.server.services.recentdeploymentsscheduler - About to run RecentDeploymentsScheduler...
22:42:27.485 | DEBUG | prefect.server.services.proactivetriggers - About to run ProactiveTriggers...
22:42:27.486 | DEBUG | prefect.server.services.proactivetriggers - Finished running ProactiveTriggers. Next run at 2025-10-13 22:42:32.485266+00:00
22:42:27.551 | INFO | prefect.server.services.marklateruns - Finished monitoring for late runs.
22:42:27.552 | DEBUG | prefect.server.services.marklateruns - Finished running MarkLateRuns. Next run at 2025-10-13 22:42:32.475929+00:00
22:42:27.553 | INFO | prefect.server.services.failexpiredpauses - Finished monitoring for late runs.
22:42:27.554 | DEBUG | prefect.server.services.failexpiredpauses - Finished running FailExpiredPauses. Next run at 2025-10-13 22:42:32.478881+00:00
22:42:27.554 | INFO | prefect.server.services.recentdeploymentsscheduler - Scheduled 0 runs.
22:42:27.555 | DEBUG | prefect.server.services.recentdeploymentsscheduler - Finished running RecentDeploymentsScheduler. Next run at 2025-10-13 22:42:32.480833+00:00
22:42:28.253 | DEBUG | prefect.server.events.triggers - Running periodic evaluation as of 2025-10-13 22:42:28.253653+00:00 (offset 0.0s)
INFO: 10.70.235.123:53910 - "POST /api/work_pools/default_work_pool/get_scheduled_flow_runs HTTP/1.1" 200 OK
22:42:34.435 | DEBUG | prefect.server.services.marklateruns - About to run MarkLateRuns...
22:42:34.437 | DEBUG | prefect.server.services.failexpiredpauses - About to run FailExpiredPauses...
22:42:34.439 | DEBUG | prefect.server.services.recentdeploymentsscheduler - About to run RecentDeploymentsScheduler...
22:42:34.441 | DEBUG | prefect.server.services.proactivetriggers - About to run ProactiveTriggers...
22:42:34.442 | DEBUG | prefect.server.services.proactivetriggers - Finished running ProactiveTriggers. Next run at 2025-10-13 22:42:39.441709+00:00
22:42:34.443 | DEBUG | prefect.server.events.triggers - Running periodic evaluation as of 2025-10-13 22:42:34.443340+00:00 (offset 0.0s)Marvin
10/13/2025, 10:44 PMMarvin
10/13/2025, 10:45 PMkubectl get pod prefect3-server-7b9cb995dc-nwqr8 -n prefect -o jsonpath='{.status.containerStatuses[0]}'
Look for lastState.terminated.reason, exitCode, signal. Please paste that.
2) Check for probe-related restarts
- Describe the pod and look for liveness/readiness failures (timeouts, HTTP 5xx):
kubectl describe pod prefect3-server-7b9cb995dc-nwqr8 -n prefect
If you see probe failures around the time you open the Graph tab, the pod is likely being restarted by probes. Temporarily relax probes while debugging:
- Increase initialDelaySeconds, timeoutSeconds, periodSeconds, and failureThreshold (e.g., timeoutSeconds: 10, periodSeconds: 10, failureThreshold: 6)
- Re-test loading the graph and watch logs
3) Add hard caps to the graph payload (prevents pathological requests)
Set on the API container and restart:
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=1000
These map to ServerFlowRunGraphSettings and will truncate very large graphs instead of attempting to serialize everything.
- Docs: ServerFlowRunGraphSettings
- Settings ref: Settings reference
4) Turn on more diagnostics
- Enable debug logs and faulthandler:
PREFECT_SERVER_LOGGING_LEVEL=DEBUG
PYTHONFAULTHANDLER=1
- If you can, increase uvicorn log verbosity (env or command flags) to see access/probe logs.
5) Reproduce the graph call directly and watch logs
- Trigger the endpoint without the UI:
curl -v "<API_URL>/api/flow_runs/<RUN_ID>/graph"
# Some versions may use /graph_v2; check your server routes/logs
- Watch server logs in another terminal:
kubectl logs -f prefect3-server-7b9cb995dc-nwqr8 -n prefect -c prefect-server
If it dies again, grab previous logs:
kubectl logs prefect3-server-7b9cb995dc-nwqr8 -n prefect -c prefect-server --previous
6) Verify resources and node pressure
- Check container resource limits and whether it’s CPU-throttled
- If you have metrics-server: kubectl top pod <pod> -n prefect during reproduction
- If you have access to the node, check kernel logs for OOM killer around the crash window (even if K8s doesn’t show OOMKilled):
- EKS: journalctl -k | grep -i kill on the node or view node events via CloudWatch, if configured
7) Run multiple API replicas and avoid the Graph tab for that run
- Scale the API Deployment to 2+ replicas to improve availability
- While investigating, avoid the Graph tab on that specific run; use Timeline or Tasks
8) Align and upgrade your Prefect version
- Your pod shows prefecthq/prefect:3.4.4-python3.11 while you mentioned 3.4.5. Please align the image tag and consider upgrading to the latest 3.4.x patch — there have been stability/perf fixes since 3.4.4
- Releases: Prefect releases
If you can share:
- The containerStatuses.lastState.terminated details
- Any probe configuration from your Deployment/Helm values
- API container resources/limits
- Approximate node count/artifact count for the run
…I can suggest exact values for the graph caps, probe settings, and resource sizing. Setting the two graph caps and relaxing probes usually prevents the crash loop so you can continue operating while we pinpoint the root cause.Sophia Ponte
10/13/2025, 10:59 PMcontainerStatuses.lastState.terminated reason is OOMKilled.
I'm currently using t3.2xlarge nodes on AWS with diskSize=50. My graph has around 3000 nodes. Do I need to update these specs?Marvin
10/13/2025, 10:59 PMMarvin
10/13/2025, 11:00 PMkubectl describe pod <api-pod> -n <ns> | grep -A3 -i limits
- If you see something like 512Mi/1Gi, raise it. Good starting point:
- requests: 1–2Gi
- limits: 4–8Gi
Helm example:
server:
api:
resources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "4000m"
memory: "6Gi"
YAML snippet for a Deployment:
resources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "4000m"
memory: "6Gi"
Note: You don’t need bigger nodes unless you can’t fit these limits on your current nodes. The OOM is at the pod cgroup limit, not the node total.
2) Add server-side caps to prevent pathological graphs
Even with more memory, guardrails help a lot:
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=100
- If artifacts are heavy for your runs, lower MAX_ARTIFACTS further (e.g., 50). This setting alone can dramatically reduce memory usage.
- Docs:
- ServerFlowRunGraphSettings
- Settings reference
3) Optional but helpful while testing
- Avoid opening the Graph tab for that large run until the above changes are deployed; use Timeline or Tasks.
- Temporarily scale API replicas to 2+ so one heavy request doesn’t impact availability:
kubectl scale deploy <your-api-deploy> -n <ns> --replicas=2
- Turn up logging briefly:
PREFECT_SERVER_LOGGING_LEVEL=DEBUG
PYTHONFAULTHANDLER=1
4) Align and upgrade your Prefect version
Your pod shows prefecthq/prefect:3.4.4-python3.11 while you mentioned 3.4.5. Align the image tag and consider upgrading to the latest 3.4.x patch; there have been stability/performance improvements in server endpoints since 3.4.4:
- Prefect releases
Why this happens
- The Graph endpoint queries all task runs + relationships (+ artifacts) and builds a large JSON. With thousands of nodes or lots of artifacts, Python memory spikes during query aggregation and serialization. If the container limit is low, the kernel OOM killer terminates the process (OOMKilled) without a Python traceback.
If you can share:
- Your current API pod resources/limits
- Whether the run produces lots of artifacts
…I can recommend exact values for limits and the graph caps. In many cases, bumping the API memory limit to 4–6Gi and setting PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS to a small number resolves the OOMs for ~3k-node graphs.Sophia Ponte
10/14/2025, 12:39 AMMarvin
10/14/2025, 12:39 AMMarvin
10/14/2025, 12:40 AMPREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=100
This keeps pathological graphs from spiking memory.
- Settings details: ServerFlowRunGraphSettings
- Settings ref: Settings reference
- Run 2+ API replicas so one heavy request doesn’t impact availability.
- Consider an HPA on memory to autoscale the API when users open big graphs.
- If you create many artifacts per task, lowering MAX_ARTIFACTS helps a lot.
If you want, share your current API limits/requests and I can suggest “safe” defaults for your workload size.