< Marvin> how does Prefect handle situations where a Kuberne Prefect Community #marvin-ai

Join Slack

<@ULVA73B9P>, how does Prefect handle situations w...

# marvin-ai

Steve Palko

09/26/2025, 7:55 PM

@Marvin, how does Prefect handle situations where a Kubernetes pod crashes and recovers?

Marvin

09/26/2025, 7:55 PM

✅ thought for 418.0 seconds

Marvin

09/26/2025, 8:02 PM

Great question. Assuming you’re using Prefect 3.x with the Kubernetes worker, here’s how crashes and recoveries are handled: What happens when a pod crashes - State detection - Heartbeats: The flow process sends heartbeats to Prefect. If heartbeats stop for long enough, Prefect marks the flow run as “Crashed” (zombie detection). See Detect zombie flows. - Kubernetes observer (optional): Prefect’s Kubernetes observer component watches your Jobs/Pods and reconciles Prefect states with Kubernetes status (e.g., a Job failing results in a “Crashed/Failed” state in the UI). See Kubernetes worker guide. - If Kubernetes restarts the pod (Job-level restart) - This is the same flow-run attempt. It does not consume a Prefect retry. - Your flow process will start again in the restarted container. Any work that wasn’t durably persisted may re-run, so make tasks idempotent and/or persist/cache results if you want resumability. - If the Job ultimately fails (backoffLimit exhausted) - Prefect will mark the flow run as Crashed/Failed. - If you configured flow-level retries, Prefect will reschedule a new attempt and the worker will create a fresh Kubernetes Job. Task-level retries also apply within a single attempt. How Prefect retries work - Flow-level retries: Re-run the entire flow run after a failure/crash. - Task-level retries: Re-run just the failing task within the same flow run attempt. - Neither depends on the Kubernetes Job backoff; Prefect retries are orchestrated by Prefect once the run is marked Failed/Crashed. - Docs: Zombie detection and Kubernetes worker. Interplay with Kubernetes Job restartPolicy/backoffLimit - Kubernetes restartPolicy/backoffLimit control container/pod restarts for the same Job attempt. These restarts do not consume Prefect retries. - Best practice: - Keep Job backoffLimit small (often 0–1) so Kubernetes doesn’t re-run a lot of work that Prefect is better at retrying. - Allow a small backoff if you want Kubernetes to handle transient container startup issues (e.g., image pull or node preemption). - If you need strong “resume” behavior after pod restarts, persist results and enable task caching so restarted attempts can skip completed work. If the worker crashes - Existing Jobs keep running in Kubernetes; they do not depend on the worker process. - New flow runs won’t be picked up until a worker is available again. - If a flow run was active and the pod died while the worker was down, the observer and/or zombie detection will reconcile the state when they’re active again. How to tune behavior - Configure retries - Flow:

Copy code

from prefect import flow

    @flow(retries=2, retry_delay_seconds=[30, 60], persist_result=True)
    def my_flow():
        ...

- Task:

Copy code

from prefect import task

    @task(retries=3, retry_delay_seconds=[10, 30, 60], persist_result=True)
    def my_task():
        ...

- Persist/cache results for resumability - Persisting results and using task caching allows restarted pods or new attempts to skip already-computed work. See storage/result configuration in the Kubernetes guide: Kubernetes worker guide. - Keep Kubernetes Job backoff small - Set a low

backoffLimit

and appropriate

restartPolicy

in your work pool’s base job template so Prefect handles most retry logic. - Use timeouts where appropriate - Add

timeout_seconds

on tasks/flows to avoid indefinite hangs that would only be cleaned up by zombie detection. - Consider running the Kubernetes observer

Marvin

09/26/2025, 8:02 PM

- The observer keeps Prefect states tightly in sync with Kubernetes pod/job status. See Kubernetes worker guide. Related reading - Kubernetes worker guide: Deploying to Kubernetes - Zombie detection and heartbeats: Detect zombie flows - Resilience on spot/preemptible nodes: Kubernetes spot instances with Prefect If you share how your Job template is set up (restartPolicy/backoffLimit) and whether you’re using the observer, I can suggest concrete settings to align Kubernetes restarts with your Prefect retries.

2 Views

Open in Slack

Previous Next