Ishan Anilbhai Koradiya
09/14/2025, 2:09 PMMarvin
09/14/2025, 2:09 PMMarvin
09/14/2025, 2:12 PMos._exit
, segfault in a C-extension, abrupt multiprocessing exit), Prefect can’t capture the exception to mark Failed — it only sees the heartbeat stop and marks Crashed.
- This often shows up as: some error appears in stdout/stderr, then the worker/pod disappears or restarts.
Quick triage questions
- What worker/infrastructure are you using: Process, Docker, or Kubernetes?
- Do you see a consistent pattern (e.g., after N minutes, on large inputs, at a specific task)?
- For one example run, can you share:
- The run ID
- The worker logs around the time it ends
- Any exit code or termination reason (e.g., 137/OOMKilled)
What to check next (by environment)
- Kubernetes
- kubectl describe pod <pod>
and kubectl logs <pod>
(and -p
for previous logs if it restarted)
- Look for Reason “OOMKilled” or exit code 137/143. If OOM, raise memory limits/requests or reduce memory usage.
- Docker
- docker ps -a
and docker inspect <container>
to get the exit code; docker logs <container>
for container logs.
- Exit code 137 typically means OOMKilled; increase container memory or reduce workload.
- Process worker (bare metal/VM)
- Check system logs (dmesg
, syslog) for OOM kills or signals; ensure nothing external is sending SIGKILL/SIGTERM.
- If you’re spawning subprocesses or using libraries that call os._exit
, that can cause crashes.
Common causes and fixes
- Out-of-memory (most common): manifests as OOMKilled/exit 137. Fix by increasing memory (K8s/Docker limits) or reducing memory footprint.
- Hard process termination: kill -9
, node reboots, preemptible spot instances, container restarts. Ensure infrastructure stability and graceful shutdowns.
- Abrupt exits in code:
- Avoid os._exit(...)
and be careful with sys.exit(...)
outside Prefect-managed flow context.
- If using multiprocessing, prefer spawn
start method and ensure child processes don’t hard-exit the parent.
- Native/C-extension segfaults (e.g., some pandas/numpy/TensorFlow ops) will crash the interpreter; upgrade the library or isolate that work behind more memory/limits.
- Heartbeat loss: if the worker loses connectivity or the runtime stops heartbeating, Prefect will classify as Crashed. Check worker logs for connectivity/errors.
How to verify behavior locally
- This should end as Failed:
from prefect import flow, get_run_logger
@flow
def will_fail():
logger = get_run_logger()
<http://logger.info|logger.info>("About to raise an exception")
raise RuntimeError("Intentional failure")
will_fail()
- This will simulate a crash (SIGKILL), likely ending as Crashed:
from prefect import flow
import os, signal
@flow
def will_crash():
os.kill(os.getpid(), signal.SIGKILL)
will_crash()
Getting “Failed” instead of “Crashed”
- Ensure failures are normal Python exceptions inside the flow/task code so Prefect can capture them.
- Avoid hard exits and manage memory to prevent OOM.
- If you must use subprocesses, ensure they report errors back instead of terminating the parent.
Optionally, auto-retry on CrashMarvin
09/14/2025, 2:12 PMIshan Anilbhai Koradiya
09/14/2025, 2:22 PMMarvin
09/14/2025, 2:22 PMMarvin
09/14/2025, 2:22 PMIshan Anilbhai Koradiya
09/14/2025, 2:44 PMMarvin
09/14/2025, 2:44 PMMarvin
09/14/2025, 2:49 PMselect l.mode, count(*)
from pg_locks l
join pg_database d on d.oid = l.database
where d.datname = current_database()
group by l.mode
order by count desc;
- Which queries are holding AccessShareLock the longest
select a.pid,
a.application_name,
a.usename,
a.state,
a.query_start,
now() - a.query_start as age,
a.query
from pg_locks l
join pg_stat_activity a on l.pid = a.pid
where l.mode = 'AccessShareLock'
and a.datname = current_database()
order by age desc
limit 50;
- See blocked vs blocking (if anything is actually waiting)
select bl.pid as blocked_pid,
ba.query as blocked_query,
now() - ba.query_start as blocked_age,
ka.pid as blocking_pid,
ka.query as blocking_query,
now() - ka.query_start as blocking_age,
kl.mode as blocking_lock_mode
from pg_locks bl
join pg_stat_activity ba on ba.pid = bl.pid
join pg_locks kl on kl.locktype = bl.locktype
and coalesce(kl.database,0) = coalesce(bl.database,0)
and coalesce(kl.relation,0) = coalesce(bl.relation,0)
and coalesce(kl.page,0) = coalesce(bl.page,0)
and coalesce(kl.tuple,0) = coalesce(bl.tuple,0)
and coalesce(kl.virtualxid,'') = coalesce(bl.virtualxid,'')
and coalesce(kl.transactionid,0) = coalesce(bl.transactionid,0)
and coalesce(kl.classid,0) = coalesce(bl.classid,0)
and coalesce(kl.objid,0) = coalesce(bl.objid,0)
and coalesce(kl.objsubid,0) = coalesce(bl.objsubid,0)
and kl.pid != bl.pid
join pg_stat_activity ka on ka.pid = kl.pid
where not bl.granted;
If you can paste a few rows from these, I can help pinpoint which Prefect endpoints/queries are hottest.
Why Prefect might be generating many reads
- Flow/task run listing, details pages, and the UI auto-refresh
- Worker polling for runs and heartbeat updates
- Log ingestion and log reads
- Automations/concurrency checks and schedules
Docs that may help
- Self-hosting Prefect (deployment + DB notes)
- Database configuration API (pool size, timeouts, etc.)
- Prefect blog: Postgres partitioning
Mitigations and tuning
- Verify it’s not DDL: Avoid running migrations/DDL (which need AccessExclusiveLock) during peak traffic — that will queue behind all AccessShareLocks and cause blocking.
- Shorten long reads:
- Ensure autovacuum is healthy to avoid bloat that makes SELECTs slow.
- Keep the DB on sufficient hardware (CPU/IO), and watch for high sequential scans or slow index usage.
- If you have very large logs/events history, consider archiving/partitioning to keep hot tables small.
- Prefect server DB pool limits:
- Prefect’s async Postgres config supports pool and timeout options like sqlalchemy_pool_size
, sqlalchemy_max_overflow
, and timeouts.
- If you’ve scaled API replicas up, make sure the aggregate connections aren’t overwhelming Postgres; capping pool size or placing PgBouncer (transaction pooling) in front can help smooth spikes.Marvin
09/14/2025, 2:49 PMstatement_timeout
for read queries to prevent runaway scans.
- Reduce log write/read pressure if possible (lower log level, batch/forward fewer logs to API) — see logging handlers: Logging to API
- Correlate with “Crashed” timing:
- If API latency/DB waits spike around the times your runs “crash”, that’s likely heartbeats timing out due to backpressure. Reducing contention should reduce Crashed classifications.
A few clarifying questions to tailor the next steps
- Are you self-hosting Prefect Server with Postgres? What Postgres version and instance size?
- Roughly how many API replicas do you run, and do you have PgBouncer in front?
- Do you see blocked queries (not just many AccessShareLocks), or API 5xx/slow responses in server logs?
- About how large are your flow_runs
, task_runs
, and logs
tables?
If you can share the outputs of the SQL above plus your API/DB topology, I’ll suggest concrete tuning (pool sizes, PgBouncer settings, and any Prefect-side configuration) to bring lock pressure down and stabilize your runs.Ishan Anilbhai Koradiya
09/14/2025, 2:59 PMMarvin
09/14/2025, 3:00 PMMarvin
09/14/2025, 3:01 PMIshan Anilbhai Koradiya
09/14/2025, 3:07 PMMarvin
09/14/2025, 3:07 PMMarvin
09/14/2025, 3:09 PMprefect work-queue create <name>
prefect work-queue ls
- Start workers pointing at distinct queues:
prefect worker start -p <pool> -q <queue-A>
prefect worker start -p <pool> -q <queue-B>
...
Reference: Work pools and queues
- Stagger polling if you must share a queue
- If you can’t shard right now, stagger worker polling to reduce simultaneous row updates. Set slightly different query intervals per worker:
# worker 1
export PREFECT_WORKER_QUERY_SECONDS=7
prefect worker start -p <pool> -q <queue>
# worker 2
export PREFECT_WORKER_QUERY_SECONDS=11
prefect worker start -p <pool> -q <queue>
# worker 3
export PREFECT_WORKER_QUERY_SECONDS=13
prefect worker start -p <pool> -q <queue>
# worker 4
export PREFECT_WORKER_QUERY_SECONDS=17
prefect worker start -p <pool> -q <queue>
Worker settings reference (heartbeat/query/prefetch): Settings reference
- Upgrade Prefect to the latest 3.x
- You’re on 3.1.15; there have been many orchestration and DB-path improvements since. Upgrading can reduce contention and improve error handling.
- Releases: Prefect releases
- DB observability and guardrails
- Enable Postgres log_lock_waits = on
and set a low deadlock_timeout
to capture detail.
- Ensure autovacuum is keeping indexes/tables tidy so transactions are short.
- Keep an eye on API latency around deadlock times; if heartbeats slow, Crashed rates will rise.
If you can try “1 worker per queue” for 15–30 minutes and confirm deadlocks drop to zero, we can plan a clean rollout: shard queues or stagger poll intervals to scale safely.Ishan Anilbhai Koradiya
09/14/2025, 3:12 PMMarvin
09/14/2025, 3:12 PMMarvin
09/14/2025, 3:16 PM# create or confirm the queue
prefect work-queue create type-x
# cap concurrency for that queue at 4
prefect work-queue set-concurrency-limit type-x 4
# start a single worker for that queue
prefect worker start -p <your-pool> -q type-x
Docs:
- Work queue CLI
- Work pools and queues
Option B — shard into multiple queues, one worker per queue
- Create 4 queues (type-x-1..4), run one worker per queue.
- Assign deployments across those queues. If you want an exact total of 4 concurrent runs, set each queue’s limit to 1.
- This avoids multiple workers hitting the same deployment rows and eliminates the deadlock you saw.
Commands:
prefect work-queue create type-x-1
prefect work-queue create type-x-2
prefect work-queue create type-x-3
prefect work-queue create type-x-4
prefect work-queue set-concurrency-limit type-x-1 1
prefect work-queue set-concurrency-limit type-x-2 1
prefect work-queue set-concurrency-limit type-x-3 1
prefect work-queue set-concurrency-limit type-x-4 1
# one worker per queue
prefect worker start -p <your-pool> -q type-x-1
prefect worker start -p <your-pool> -q type-x-2
prefect worker start -p <your-pool> -q type-x-3
prefect worker start -p <your-pool> -q type-x-4
Option C — global concurrency limit across many queues
- If “type-X” flows are spread across multiple queues/pools, use a Global Concurrency Limit (GCL) with a shared name/tag for those flows to cap them at 4 in total.
prefect gcl create type-x-cap --limit 4
Docs: Global concurrency limits
What to avoid (and why)
- Multiple workers polling the same queue. This can cause concurrent UPDATEs to the same deployment/work-queue rows (e.g., last_polled/status) and lead to Postgres deadlocks. You’ve already hit this.
- If you must temporarily share a queue, stagger poll intervals to reduce collisions:
# different workers with different intervals
export PREFECT_WORKER_QUERY_SECONDS=7; prefect worker start -p <pool> -q <queue>
export PREFECT_WORKER_QUERY_SECONDS=11; prefect worker start -p <pool> -q <queue>
export PREFECT_WORKER_QUERY_SECONDS=13; prefect worker start -p <pool> -q <queue>
export PREFECT_WORKER_QUERY_SECONDS=17; prefect worker start -p <pool> -q <queue>
Settings reference: Settings reference
Recommendation for your setup right now
- Move to Option A: keep a single worker per queue and set a work-queue concurrency limit to 4. This will give you the parallelism you want and eliminate the deadlocks that can cascade into Crashed runs.
- Or use Option B if you prefer multiple workers: shard into multiple queues with one worker per queue.
Also consider upgrading from 3.1.15 to the latest 3.x — there have been stability improvements in orchestration/DB paths. Releases: Prefect releases
If you share how your “type-X” deployments are assigned today, I’m happy to suggest an exact queue layout and limits to get you to 4 concurrent safely.Ishan Anilbhai Koradiya
09/14/2025, 3:23 PMMarvin
09/14/2025, 3:23 PMMarvin
09/14/2025, 3:24 PMPREFECT_WORKER_QUERY_SECONDS
, PREFECT_WORKER_PREFETCH_SECONDS
, PREFECT_WORKER_HEARTBEAT_SECONDS
control how often the worker fetches work and reports health. Defaults are fine for most, but you can increase PREFECT_WORKER_PREFETCH_SECONDS
slightly if you want it to grab more work ahead of time.
Good defaults
- Use one worker per queue with a queue concurrency limit of 4. That worker will steadily launch up to 4 flow runs and maintain them even if they each run for hours.
- Ensure your infra has capacity: for process workers, enough CPU/RAM; for Docker/K8s, check container/pod quotas and per-run resource requests/limits.
When to use more than one worker
- High availability across nodes or AZs
- Separate infra types (e.g., GPU vs CPU)
- Very high scheduling throughput (many short runs per minute)
But avoid multiple workers on the same queue unless you shard queues or stagger polling; otherwise you can reintroduce the deadlocks we saw.
If you share which worker type you’re using (process/docker/k8s) and typical run runtime/resource needs, I can suggest concrete resource settings to comfortably sustain 4 long-running runs.Ishan Anilbhai Koradiya
09/15/2025, 4:16 AMMarvin
09/15/2025, 4:16 AMMarvin
09/15/2025, 4:16 AMIshan Anilbhai Koradiya
09/15/2025, 4:19 AMMarvin
09/15/2025, 4:19 AMMarvin
09/15/2025, 4:25 AM# give calls more time (e.g. 180–300s)
export PREFECT_API_REQUEST_TIMEOUT=180
# add more resilience to transient issues
export PREFECT_CLIENT_MAX_RETRIES=8
# optionally add extra retryable status codes
export PREFECT_CLIENT_RETRY_EXTRA_CODES=429,502,503,504
- Optional: enable HTTP/2 to multiplex requests if your network allows it:
export PREFECT_API_ENABLE_HTTP2=true
- Reduce API chatter from logs to rule out log uploads as the cause:
# temporarily disable log shipping to the API
export PREFECT_LOGGING_TO_API_ENABLED=false
# or increase log batch interval/size to send fewer, larger requests
export PREFECT_LOGGING_TO_API_BATCH_INTERVAL=5
- Add retries on your flow/subflow:
from prefect import flow
@flow(retries=2, retry_delay_seconds=30)
def parent():
# For a subflow, you can also apply options:
result = child.with_options(retries=2, retry_delay_seconds=15)()
- If you’re fanning out many subflows at once, consider staggering or batching them to lower API pressure.
2) Why the run is reported as Crashed
“Crashed” means the backend stopped receiving heartbeats from the process running your flow (or the process terminated unexpectedly). Common causes:
- The process/pod/container was killed (OOM, eviction, preemption, host restart).
- The runner stopped sending heartbeats due to a long blocking call or heavy CPU/GIL contention.
- A long network partition prevented heartbeats and state updates.
- Severe logging or network backpressure stalled the event loop long enough to miss heartbeats.
What you can tune/check
- Ensure heartbeats are frequent enough (min allowed is 30s):
export PREFECT_RUNNER_HEARTBEAT_FREQUENCY=30
- Check for infra-level terminations or OOM in your worker logs, container logs, or Kubernetes events.
- If you have very CPU-bound sections, consider yielding control periodically or moving the heavy work out of the flow function to tasks/subprocesses so the runner can keep sending heartbeats.
Diagnostics to help pinpoint
- Share a short snippet of the exact httpx timeout error text (connect vs read vs write vs pool timeout is very telling).
- Are you on Prefect Cloud or self-hosted Prefect Server? If Server, is there a reverse proxy with aggressive timeouts (e.g., 60s idle/read timeout)? That can cause mid-run timeouts.
- Which worker and infrastructure type are you using (process, Docker, Kubernetes)? Any signs of OOM/eviction?
- Roughly how many subflows are launched concurrently when this happens?
- Check the run timeline for “missed heartbeat” windows: Detect zombie flows (crash detection)
- If on Cloud, this guide can help spot environment issues: Troubleshoot Prefect Cloud
- Confirm effective settings with:
prefect profile inspect
If you can share:
- A snippet of the httpx stack trace (connect/read/write/total timeout),
- Where in the subflow fan-out it occurs,
- Your environment (Cloud vs Server, worker type, k8s/pod details, any proxy),
…I can recommend more targeted changes.Ishan Anilbhai Koradiya
09/15/2025, 6:43 AMMarvin
09/15/2025, 6:43 AMMarvin
09/15/2025, 6:44 AMhttpx.ReadTimeout
specifically means the request connected and started waiting for the response body, but the read didn’t complete within the client’s read timeout. In Prefect 3.x the default API request timeout is 60s, so long-running/slow API responses (or a proxy/load-balancer with ~60s read/idle timeouts) can trigger this “in the middle” of a run, especially with many subflows.
Why it may be reported as Crashed
- Prefect marks a run as Crashed when heartbeats stop (e.g., the process is blocked/killed or can’t reach the API long enough). A long, blocking network call that repeatedly times out can cause missed heartbeats, so the backend assumes the process died. See Detect zombie flows.
Quick mitigations to try
- Increase API request timeout (try 180–300s) and client retries:
export PREFECT_API_REQUEST_TIMEOUT=180
export PREFECT_CLIENT_MAX_RETRIES=8
export PREFECT_CLIENT_RETRY_EXTRA_CODES=429,502,503,504
- Optional: enable HTTP/2 if your network supports it (can improve multiplexing under load):
export PREFECT_API_ENABLE_HTTP2=true
- Reduce API chatter from log uploads (to rule out log shipping timeouts):
export PREFECT_LOGGING_TO_API_ENABLED=false
# or, send fewer batches
export PREFECT_LOGGING_TO_API_BATCH_INTERVAL=5
- Add retries to the parent/subflows for resilience against transient failures:
from prefect import flow
@flow(retries=2, retry_delay_seconds=30)
def parent():
child_res = child.with_options(retries=2, retry_delay_seconds=15)()
- Ensure heartbeats aren’t too sparse (min allowed is 30s):
export PREFECT_RUNNER_HEARTBEAT_FREQUENCY=30
Diagnostics to pinpoint the exact call timing out
- Turn on debug logging and (temporarily) keep logs local:
export PREFECT_LOGGING_LEVEL=DEBUG
export PREFECT_LOGGING_TO_API_ENABLED=false
- Also capture httpx/httpcore debug logs in your run entrypoint to see which endpoint is stalling:
import logging
logging.getLogger("httpx").setLevel("DEBUG")
logging.getLogger("httpcore").setLevel("DEBUG")
- Confirm the active settings:
prefect profile inspect
- If you’re self-hosting Prefect Server behind a reverse proxy (NGINX/ALB/etc.), check and raise proxy_read_timeout/idle timeouts beyond 60s. 60s defaults are a very common cause of mid-run ReadTimeouts.
A few clarifying questions so I can give more targeted guidance:
- Are you on Prefect Cloud or self-hosted Prefect Server? If Server, is there a proxy/ingress in front of it?
- Which worker/infrastructure are you using (process, Docker, Kubernetes)? Any OOM/eviction or node restarts around the same time?
- Roughly how many subflows are launched concurrently when the timeout occurs?
- Does the timeout happen close to the 60s mark? If yes, that strongly suggests a client or proxy timeout.
Related docs:
- Crash detection and “zombie” runs: Detect zombie flows
- Worker/runner healthchecks: Worker healthchecks
- Troubleshooting Cloud: Troubleshoot Prefect Cloud
If you can share a slightly larger slice of the stack (ideally a few DEBUG lines showing the URL/endpoint right before the timeout) and your environment details, I’ll help zero in on the exact bottleneck and the best setting(s) to adjust.Ishan Anilbhai Koradiya
09/15/2025, 3:54 PMMarvin
09/15/2025, 3:54 PMMarvin
09/15/2025, 3:56 PMprefect work-pool pause <pool-name>
# ... wait for runs to finish ...
prefect work-pool resume <pool-name>
- You can also temporarily stop workers to ensure no new pulls occur.
- If you must restart while jobs are running:
- Keep the outage short (ideally < 1–2 heartbeat intervals) to reduce “Crashed” markings.
- Consider temporarily increasing client timeouts/retries on workers to ride through the blip:
export PREFECT_API_REQUEST_TIMEOUT=180
export PREFECT_CLIENT_MAX_RETRIES=8
- Optionally bump heartbeat spacing so fewer are missed during a brief restart:
export PREFECT_RUNNER_HEARTBEAT_FREQUENCY=60
- Be aware you may lose some logs/artifacts during the downtime.
Related docs
- Crash detection and heartbeat behavior: Detect zombie flows
- Worker healthchecks overview: Worker healthchecks
If you share how long your restart typically takes and whether your flows create subflows mid-run, I can suggest the least disruptive sequence (e.g., quick pause/resume vs. draining runs).Ishan Anilbhai Koradiya
09/15/2025, 4:05 PMMarvin
09/15/2025, 4:05 PMMarvin
09/15/2025, 4:08 PMIshan Anilbhai Koradiya
09/15/2025, 4:19 PMMarvin
09/15/2025, 4:19 PMMarvin
09/15/2025, 4:21 PMexport PREFECT_API_REQUEST_TIMEOUT=180
export PREFECT_CLIENT_MAX_RETRIES=8
export PREFECT_CLIENT_RETRY_EXTRA_CODES=429,502,503,504
- Optional: enable HTTP/2 to improve multiplexing if your network supports it:
export PREFECT_API_ENABLE_HTTP2=true
- Reduce API chatter from logs to rule out log upload stalls:
export PREFECT_LOGGING_TO_API_ENABLED=false
# or send fewer, larger batches
export PREFECT_LOGGING_TO_API_BATCH_INTERVAL=5
- If you’re self-hosting behind NGINX/ALB/etc., raise proxy_read_timeout/idle timeouts above 60s.
If you can share:
- Roughly how many subflows you spin up and whether they’re sequential or concurrent
- Cloud vs self-hosted Server (and whether there’s a proxy/ingress)
- Whether the timeout happens near the 60s mark
…I can suggest a concrete throttling pattern and exact settings for your setup. For background on crash/heartbeat behavior: Detect zombie flows.