<@ULVA73B9P> I have a self hosted prefect server o...
# ask-marvin
s
@Marvin I have a self hosted prefect server on EKS (version 3.4.5) and for some reason I am unable to load the graph for a specific flow run. It's a large graph, and whenever I try loading the graph it crashes my prefect server. I have tried updating the DATABASE timeouts, but that has not seemed to help. What can I do to debug?
m
thought for 412.7 seconds
Thanks for the details — this is a fairly common pain point with very large flow run graphs. What’s likely happening is that the API endpoint that builds the graph payload for the UI is doing a lot of work (pulling all task runs + relationships + artifacts, serializing a huge JSON), and the API pod runs out of memory or CPU and crashes. Increasing DB timeouts alone won’t help much because the bottleneck is often CPU/memory during graph construction and serialization, not the DB. A few targeted ways to debug and mitigate: Ask/confirm a few details - Do you see OOMKilled events on the API pod when you open the Graph tab? Check with
kubectl describe pod
and
kubectl get events
. - Roughly how many task runs/nodes are in this flow run? - What resources/limits are set on your API pod in EKS (CPU/memory)? - Any stack traces or errors from the API logs when you load the graph? Immediate mitigations 1) Reduce graph size the server will try to build - Prefect 3.x exposes server-side caps for the graph: -
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES
(default 10000) -
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS
(default 10000) - Lower these significantly to protect the API from excessive memory/CPU on huge runs. For example:
Copy code
# In your API deployment/pod env vars
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=1000
- These map to ServerFlowRunGraphSettings; see: - Settings reference - ServerFlowRunGraphSettings Note: With caps, the UI graph may be truncated rather than crashing the API — that’s usually a better UX than a server restart. 2) Temporarily avoid loading the Graph tab for that run - Open the flow run page directly on the Timeline or Tasks tab instead of Graph to avoid triggering the heavy graph call. For example: - Navigate to the run and change the tab query param to
?tab=timeline
or click Timeline first before landing on Graph. 3) Increase API pod resources - If you consistently need to visualize very large graphs, increase CPU/memory requests/limits for the API pod (e.g., 2–4 vCPU and 4–8Gi memory depending on your flow sizes). - Verify that the pod is not throttled on CPU. 4) Upgrade to a newer 3.4.x patch - You’re on 3.4.5; there have been many performance and stability fixes since then. It’s worth upgrading to the latest 3.4.x available to you. Release notes: - Prefect releases Concrete debugging steps - Turn up server logging to see what the API is doing when the graph loads:
Copy code
PREFECT_SERVER_LOGGING_LEVEL=DEBUG
- Reproduce the API call without the UI to isolate behavior. From a shell where the API is reachable:
Copy code
# Replace <RUN_ID> and <API_URL>
curl -v "<API_URL>/api/flow_runs/<RUN_ID>/graph"
# or sometimes /graph_v2 depending on UI; check your server routes
Watch the API logs while running that request. If the pod dies, check: -
kubectl logs <api-pod> -n <namespace> --previous
-
kubectl describe pod <api-pod>
for OOMKilled or restart reason - Postgres logs for long-running queries (optional) Kubernetes/Helm examples - If you’re using the Helm chart, add env vars to the API container (values might look like):
Copy code
server:
  api:
    extraEnv:
      - name: PREFECT_SERVER_LOGGING_LEVEL
        value: DEBUG
      - name: PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES
        value: "3000"
      - name: PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS
        value: "1000"
  resources:
    requests:
      cpu: "1000m"
      memory: "2Gi"
    limits:
      cpu: "4000m"
      memory: "8Gi"
Adjust to your environment. If this is only a single pathological run - Consider whether that run created an unusually large number of artifacts. Lowering
MAX_ARTIFACTS
helps a lot if artifacts are the bulk.
- As a workaround for investigation, avoid opening Graph for that run and instead use the Tasks or Timeline tabs to inspect it. References - Server flow run graph settings: ServerFlowRunGraphSettings - Settings system reference: Settings reference - Self-hosted guidance: Self-hosted overview If you can share: - API pod logs around the crash - The run’s approximate node count - Your current API pod resource limits I can help tailor exact values (caps and resources) and confirm whether we’re hitting an OOM, CPU throttle, or something else.
s
@Marvin I don't see any OOM errors. This is what I see when I describe the server pod:
Copy code
Events:
  Type     Reason   Age                  From     Message
  ----     ------   ----                 ----     -------
  Normal   Pulled   15m (x18 over 161m)  kubelet  Container image "prefecthq/prefect:3.4.4-python3.11" already present on machine
  Normal   Created  15m (x18 over 161m)  kubelet  Created container: prefect-server
  Normal   Started  15m (x18 over 161m)  kubelet  Started container prefect-server
  Warning  BackOff  14m (x47 over 145m)  kubelet  Back-off restarting failed container prefect-server in pod prefect3-server-7b9cb995dc-nwqr8_prefect(40805481-d673-4a3d-b1b7-543c1125aa09)
m
thought for 31.8 seconds
Thanks — that BackOff suggests the server container is crashing/restarting, not being OOMKilled. Given it happens when you load the Graph tab for a very large run, the most likely causes are: - The graph endpoint is doing heavy work and the process exits on an unhandled exception, or - Health probes are failing while the server is busy building/serializing the graph payload, causing Kubernetes to restart the container. Here’s a focused plan to debug and mitigate: 1) Get the crash reason from logs - Grab logs from the previously crashed container instance (this often shows the exception that caused the exit):
Copy code
# Replace namespace/pod/container names as needed
kubectl logs prefect3-server-7b9cb995dc-nwqr8 -n prefect -c prefect-server --previous
- If that’s empty, try without --previous to see current logs while reproducing:
Copy code
kubectl logs -f prefect3-server-7b9cb995dc-nwqr8 -n prefect -c prefect-server
- Also check events for probe failures:
Copy code
kubectl describe pod prefect3-server-7b9cb995dc-nwqr8 -n prefect
Please paste any stack trace you see from those logs. 2) Cap graph size on the server (prevents pathological requests) Set these env vars on the API container, then restart:
Copy code
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=1000
This forces the server to truncate very large graphs instead of attempting to build/serialize an enormous payload that can destabilize the process. - Docs: ServerFlowRunGraphSettings - Settings reference: Settings reference 3) Turn up server logging to catch errors Temporarily enable debug logs and (optionally) SQL echo:
Copy code
PREFECT_SERVER_LOGGING_LEVEL=DEBUG
PREFECT_SERVER_DATABASE_ECHO=true
Reproduce once, then turn these back down. 4) Tune or relax probes while testing If liveness/readiness probes are too aggressive, they can restart the pod while it’s busy building the graph. Temporarily increase: - initialDelaySeconds - timeoutSeconds - periodSeconds - failureThreshold Example (conceptually): failureThreshold 10, timeoutSeconds 5, periodSeconds 10. Reproduce the issue to confirm the server stays up long enough to emit logs. 5) Avoid triggering the heavy endpoint while you investigate - Open the flow run on the Timeline or Tasks tab first (avoid Graph). You can hit a direct URL that lands on Timeline (if your UI supports a tab query param) or just be careful not to click Graph for that run. 6) Scale and resource considerations - Scale the API Deployment to 2+ replicas so a single heavy request doesn’t take down availability for everyone. - Ensure you’ve given the API adequate CPU/memory and that it’s not CPU-throttled. 7) Version mismatch and upgrade - You mentioned 3.4.5, but the pod shows image 3.4.4-python3.11. Please align to a single version and consider upgrading to the latest 3.4.x patch — there have been many stability/performance improvements since 3.4.4. Releases: Prefect releases 8) Reproduce the graph call outside the UI (optional, for isolation) Run the graph endpoint manually and watch server logs:
Copy code
# Replace <API_URL> and <RUN_ID>
curl -v "<API_URL>/api/flow_runs/<RUN_ID>/graph"
# Some versions may use /graph_v2 — check your server logs/routes
If you can share: - The output of
kubectl logs ... --previous
from the server pod when it crashes - Your current liveness/readiness probe settings - API container resources/limits - Approximate node count and whether the run created lots of artifacts …I can recommend exact caps and probe/resource values. In the meantime, adding these env vars and avoiding the Graph tab for that run should stop the crash loop:
Copy code
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=1000
PREFECT_SERVER_LOGGING_LEVEL=DEBUG
s
@Marvin There are no errors in the server pod. However, the pod just dies whenever I try to load the graph (without any noticeable errors). Here are the last few logs before it crashes:
Copy code
22:42:27.471 | DEBUG   | prefect.server.services.cancellationcleanup - About to run CancellationCleanup...
22:42:27.475 | DEBUG   | prefect.server.services.marklateruns - About to run MarkLateRuns...
22:42:27.478 | DEBUG   | prefect.server.services.failexpiredpauses - About to run FailExpiredPauses...
22:42:27.480 | DEBUG   | prefect.server.services.recentdeploymentsscheduler - About to run RecentDeploymentsScheduler...
22:42:27.485 | DEBUG   | prefect.server.services.proactivetriggers - About to run ProactiveTriggers...
22:42:27.486 | DEBUG   | prefect.server.services.proactivetriggers - Finished running ProactiveTriggers. Next run at 2025-10-13 22:42:32.485266+00:00
22:42:27.551 | INFO    | prefect.server.services.marklateruns - Finished monitoring for late runs.
22:42:27.552 | DEBUG   | prefect.server.services.marklateruns - Finished running MarkLateRuns. Next run at 2025-10-13 22:42:32.475929+00:00
22:42:27.553 | INFO    | prefect.server.services.failexpiredpauses - Finished monitoring for late runs.
22:42:27.554 | DEBUG   | prefect.server.services.failexpiredpauses - Finished running FailExpiredPauses. Next run at 2025-10-13 22:42:32.478881+00:00
22:42:27.554 | INFO    | prefect.server.services.recentdeploymentsscheduler - Scheduled 0 runs.
22:42:27.555 | DEBUG   | prefect.server.services.recentdeploymentsscheduler - Finished running RecentDeploymentsScheduler. Next run at 2025-10-13 22:42:32.480833+00:00
22:42:28.253 | DEBUG   | prefect.server.events.triggers - Running periodic evaluation as of 2025-10-13 22:42:28.253653+00:00 (offset 0.0s)
INFO:     10.70.235.123:53910 - "POST /api/work_pools/default_work_pool/get_scheduled_flow_runs HTTP/1.1" 200 OK
22:42:34.435 | DEBUG   | prefect.server.services.marklateruns - About to run MarkLateRuns...
22:42:34.437 | DEBUG   | prefect.server.services.failexpiredpauses - About to run FailExpiredPauses...
22:42:34.439 | DEBUG   | prefect.server.services.recentdeploymentsscheduler - About to run RecentDeploymentsScheduler...
22:42:34.441 | DEBUG   | prefect.server.services.proactivetriggers - About to run ProactiveTriggers...
22:42:34.442 | DEBUG   | prefect.server.services.proactivetriggers - Finished running ProactiveTriggers. Next run at 2025-10-13 22:42:39.441709+00:00
22:42:34.443 | DEBUG   | prefect.server.events.triggers - Running periodic evaluation as of 2025-10-13 22:42:34.443340+00:00 (offset 0.0s)
m
thought for 56.4 seconds
Thanks — the absence of any Python traceback plus an immediate restart strongly suggests one of these: - The process is being killed externally (e.g., liveness/readiness probe timeouts or node-level pressure) - A hard crash (signal 9/11) without Python handling - The server is doing so much work building the graph response that it stops answering probes and gets restarted Let’s narrow it down and add guardrails: 1) Confirm why the container exited - Get the last termination reason/exit code/signal:
Copy code
kubectl get pod prefect3-server-7b9cb995dc-nwqr8 -n prefect -o jsonpath='{.status.containerStatuses[0]}'
Look for
lastState.terminated.reason
,
exitCode
,
signal
. Please paste that. 2) Check for probe-related restarts - Describe the pod and look for liveness/readiness failures (timeouts, HTTP 5xx):
Copy code
kubectl describe pod prefect3-server-7b9cb995dc-nwqr8 -n prefect
If you see probe failures around the time you open the Graph tab, the pod is likely being restarted by probes. Temporarily relax probes while debugging: - Increase
initialDelaySeconds
,
timeoutSeconds
,
periodSeconds
, and
failureThreshold
(e.g., timeoutSeconds: 10, periodSeconds: 10, failureThreshold: 6) - Re-test loading the graph and watch logs 3) Add hard caps to the graph payload (prevents pathological requests) Set on the API container and restart:
Copy code
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=1000
These map to ServerFlowRunGraphSettings and will truncate very large graphs instead of attempting to serialize everything. - Docs: ServerFlowRunGraphSettings - Settings ref: Settings reference 4) Turn on more diagnostics - Enable debug logs and faulthandler:
Copy code
PREFECT_SERVER_LOGGING_LEVEL=DEBUG
PYTHONFAULTHANDLER=1
- If you can, increase uvicorn log verbosity (env or command flags) to see access/probe logs. 5) Reproduce the graph call directly and watch logs - Trigger the endpoint without the UI:
Copy code
curl -v "<API_URL>/api/flow_runs/<RUN_ID>/graph"
# Some versions may use /graph_v2; check your server routes/logs
- Watch server logs in another terminal:
Copy code
kubectl logs -f prefect3-server-7b9cb995dc-nwqr8 -n prefect -c prefect-server
If it dies again, grab previous logs:
Copy code
kubectl logs prefect3-server-7b9cb995dc-nwqr8 -n prefect -c prefect-server --previous
6) Verify resources and node pressure - Check container resource limits and whether it’s CPU-throttled - If you have metrics-server:
kubectl top pod <pod> -n prefect
during reproduction - If you have access to the node, check kernel logs for OOM killer around the crash window (even if K8s doesn’t show OOMKilled): - EKS:
journalctl -k | grep -i kill
on the node or view node events via CloudWatch, if configured 7) Run multiple API replicas and avoid the Graph tab for that run - Scale the API Deployment to 2+ replicas to improve availability - While investigating, avoid the Graph tab on that specific run; use Timeline or Tasks 8) Align and upgrade your Prefect version - Your pod shows
prefecthq/prefect:3.4.4-python3.11
while you mentioned 3.4.5. Please align the image tag and consider upgrading to the latest 3.4.x patch — there have been stability/perf fixes since 3.4.4 - Releases: Prefect releases If you can share: - The
containerStatuses.lastState.terminated
details - Any probe configuration from your Deployment/Helm values - API container resources/limits - Approximate node count/artifact count for the run …I can suggest exact values for the graph caps, probe settings, and resource sizing. Setting the two graph caps and relaxing probes usually prevents the crash loop so you can continue operating while we pinpoint the root cause.
s
@Marvin the
containerStatuses.lastState.terminated
reason is
OOMKilled
. I'm currently using
t3.2xlarge
nodes on AWS with diskSize=50. My graph has around 3000 nodes. Do I need to update these specs?
m
thought for 20.4 seconds
Thanks — OOMKilled confirms the container ran out of memory when the Graph endpoint was hit. This is almost always the API pod’s memory limit being too low for building/serializing a large graph payload, not the node’s total memory. Given: - Node type t3.2xlarge (32 GiB) is fine - Graph ~3,000 nodes …it’s very likely your API container’s memory limit is the bottleneck. What to do next (practical order) 1) Increase the API container memory limit - Check your current requests/limits:
Copy code
kubectl describe pod <api-pod> -n <ns> | grep -A3 -i limits
- If you see something like 512Mi/1Gi, raise it. Good starting point: - requests: 1–2Gi - limits: 4–8Gi Helm example:
Copy code
server:
  api:
    resources:
      requests:
        cpu: "1000m"
        memory: "2Gi"
      limits:
        cpu: "4000m"
        memory: "6Gi"
YAML snippet for a Deployment:
Copy code
resources:
  requests:
    cpu: "1000m"
    memory: "2Gi"
  limits:
    cpu: "4000m"
    memory: "6Gi"
Note: You don’t need bigger nodes unless you can’t fit these limits on your current nodes. The OOM is at the pod cgroup limit, not the node total. 2) Add server-side caps to prevent pathological graphs Even with more memory, guardrails help a lot:
Copy code
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=100
- If artifacts are heavy for your runs, lower
MAX_ARTIFACTS
further (e.g., 50). This setting alone can dramatically reduce memory usage. - Docs: - ServerFlowRunGraphSettings - Settings reference 3) Optional but helpful while testing - Avoid opening the Graph tab for that large run until the above changes are deployed; use Timeline or Tasks. - Temporarily scale API replicas to 2+ so one heavy request doesn’t impact availability:
Copy code
kubectl scale deploy <your-api-deploy> -n <ns> --replicas=2
- Turn up logging briefly:
Copy code
PREFECT_SERVER_LOGGING_LEVEL=DEBUG
PYTHONFAULTHANDLER=1
4) Align and upgrade your Prefect version Your pod shows
prefecthq/prefect:3.4.4-python3.11
while you mentioned 3.4.5. Align the image tag and consider upgrading to the latest 3.4.x patch; there have been stability/performance improvements in server endpoints since 3.4.4: - Prefect releases Why this happens - The Graph endpoint queries all task runs + relationships (+ artifacts) and builds a large JSON. With thousands of nodes or lots of artifacts, Python memory spikes during query aggregation and serialization. If the container limit is low, the kernel OOM killer terminates the process (OOMKilled) without a Python traceback. If you can share: - Your current API pod resources/limits - Whether the run produces lots of artifacts …I can recommend exact values for limits and the graph caps. In many cases, bumping the API memory limit to 4–6Gi and setting
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS
to a small number resolves the OOMs for ~3k-node graphs.
s
Increasing the limit and request size worked! thanks @Marvin
m
thought for 11.3 seconds
Awesome — glad that did it! A few quick tips to prevent regressions: - Add guardrails: set
Copy code
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_NODES=3000
PREFECT_SERVER_FLOW_RUN_GRAPH_MAX_ARTIFACTS=100
This keeps pathological graphs from spiking memory. - Settings details: ServerFlowRunGraphSettings - Settings ref: Settings reference - Run 2+ API replicas so one heavy request doesn’t impact availability. - Consider an HPA on memory to autoscale the API when users open big graphs. - If you create many artifacts per task, lowering
MAX_ARTIFACTS
helps a lot. If you want, share your current API limits/requests and I can suggest “safe” defaults for your workload size.