vineeth
08/19/2025, 10:33 AMMarvin
08/19/2025, 10:33 AMMarvin
08/19/2025, 10:36 AMactiveDeadlineSeconds
to your work pool’s base Job template. After that time (includes Pending time), Kubernetes will mark the Job as failed and Prefect will reflect the failure.
- Example (Job spec level):
spec:
activeDeadlineSeconds: 1800 # 30 minutes
backoffLimit: 0
- Use Prefect Automations to detect “hasn’t started after N minutes” and cancel/notify
- Pair this with flow retries so it tries again later when capacity frees up:
from prefect import flow
@flow(retries=3, retry_delay_seconds=600)
def my_flow():
...
- Set a concurrency limit on your work pool so the worker doesn’t launch more jobs than your cluster can schedule.
- Fix it at the cluster level: raise max pods per node, add nodes/enable cluster autoscaler, or use priority classes/preemption so these pods aren’t starved.
Helpful docs:
- Workers and work pools: https://docs-3.prefect.io/v3/concepts/workers and https://docs-3.prefect.io/v3/concepts/work-pools
- Automations: https://docs-3.prefect.io/v3/concepts/automations
- Kubernetes Job termination (activeDeadlineSeconds): https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
If you can share:
- Prefect version (assuming 3.x)
- Worker type (Kubernetes work pool?)
- Exact run state transitions and any worker logs
…I can be more precise, especially about the “crashing” you mentioned.