Great question. Assuming you’re using Prefect 3.x with the Kubernetes worker, here’s how crashes and recoveries are handled:
What happens when a pod crashes
- State detection
- Heartbeats: The flow process sends heartbeats to Prefect. If heartbeats stop for long enough, Prefect marks the flow run as “Crashed” (zombie detection). See
Detect zombie flows.
- Kubernetes observer (optional): Prefect’s Kubernetes observer component watches your Jobs/Pods and reconciles Prefect states with Kubernetes status (e.g., a Job failing results in a “Crashed/Failed” state in the UI). See
Kubernetes worker guide.
- If Kubernetes restarts the pod (Job-level restart)
- This is the same flow-run attempt. It does not consume a Prefect retry.
- Your flow process will start again in the restarted container. Any work that wasn’t durably persisted may re-run, so make tasks idempotent and/or persist/cache results if you want resumability.
- If the Job ultimately fails (backoffLimit exhausted)
- Prefect will mark the flow run as Crashed/Failed.
- If you configured flow-level retries, Prefect will reschedule a new attempt and the worker will create a fresh Kubernetes Job. Task-level retries also apply within a single attempt.
How Prefect retries work
- Flow-level retries: Re-run the entire flow run after a failure/crash.
- Task-level retries: Re-run just the failing task within the same flow run attempt.
- Neither depends on the Kubernetes Job backoff; Prefect retries are orchestrated by Prefect once the run is marked Failed/Crashed.
- Docs:
Zombie detection and
Kubernetes worker.
Interplay with Kubernetes Job restartPolicy/backoffLimit
- Kubernetes restartPolicy/backoffLimit control container/pod restarts for the same Job attempt. These restarts do not consume Prefect retries.
- Best practice:
- Keep Job backoffLimit small (often 0–1) so Kubernetes doesn’t re-run a lot of work that Prefect is better at retrying.
- Allow a small backoff if you want Kubernetes to handle transient container startup issues (e.g., image pull or node preemption).
- If you need strong “resume” behavior after pod restarts, persist results and enable task caching so restarted attempts can skip completed work.
If the worker crashes
- Existing Jobs keep running in Kubernetes; they do not depend on the worker process.
- New flow runs won’t be picked up until a worker is available again.
- If a flow run was active and the pod died while the worker was down, the observer and/or zombie detection will reconcile the state when they’re active again.
How to tune behavior
- Configure retries
- Flow:
from prefect import flow
@flow(retries=2, retry_delay_seconds=[30, 60], persist_result=True)
def my_flow():
...
- Task:
from prefect import task
@task(retries=3, retry_delay_seconds=[10, 30, 60], persist_result=True)
def my_task():
...
- Persist/cache results for resumability
- Persisting results and using task caching allows restarted pods or new attempts to skip already-computed work. See storage/result configuration in the Kubernetes guide:
Kubernetes worker guide.
- Keep Kubernetes Job backoff small
- Set a low
backoffLimit
and appropriate
restartPolicy
in your work pool’s base job template so Prefect handles most retry logic.
- Use timeouts where appropriate
- Add
timeout_seconds
on tasks/flows to avoid indefinite hangs that would only be cleaned up by zombie detection.
- Consider running the Kubernetes observer