Denys Y
09/29/2025, 10:47 AMMarvin
09/29/2025, 10:47 AMMarvin
09/29/2025, 10:52 AMactiveDeadlineSeconds
on the Job so Kubernetes marks it Failed if the pod doesn’t start within N seconds. Prefect will then mark the flow run as failed/crashed.
- Optionally, set a small backoffLimit
as well.
Example when deploying a flow (Prefect 3.x), passing job variables to the Kubernetes worker:
from prefect import flow
@flow
def my_flow():
...
deployment = (
my_flow
.from_source(
source="<https://github.com/owner/repo>",
entrypoint="path/to/flow.py:my_flow",
)
.deploy(
name="k8s-deploy",
work_pool_name="my-k8s-pool",
job_variables={
"image": "myrepo/myimage:tag",
# Fail job if it cannot start within 5 minutes
"job": {
"spec": {
"activeDeadlineSeconds": 300,
"backoffLimit": 0
}
},
# Example: ensure your registry creds are set so you don’t hit ImagePullBackOff
"pod": {
"spec": {
"imagePullSecrets": [{"name": "my-regcred"}]
}
}
},
)
)
Notes:
- The exact shape of job_variables
depends on how you need to override the Job/Pod spec in your cluster. The goal is to surface spec.activeDeadlineSeconds
(and optionally backoffLimit
) on the Job Kubernetes creates.
- Docs: Kubernetes deployments and Workers
2) Fail runs with a Prefect Automation if they’re stuck in Submitted/Pending
- In Prefect Cloud, create an Automation that triggers when a flow run has been in Submitted (or Pending) for more than N minutes and sets the flow run state to Failed (or Cancels it).
- This is a good safety net that doesn’t rely on Kubernetes marking the Job failed.
- Docs: Automations
Practical mitigations to reduce ImagePullBackOff
- Ensure correct imagePullSecrets
on the Pod spec (private registries).
- Use IfNotPresent
pull policy to avoid unnecessary pulls when the image is cached on nodes.
- Pre-pull your base images on nodes (DaemonSet or node init strategy).
Questions to make sure we land the config correctly:
- Are you on Prefect 3.x with the Kubernetes worker?
- Can you share how you’re deploying (Python deploy call vs CLI) and your current job_variables
, if any?
- Do you prefer failing at the Kubernetes Job level (via activeDeadlineSeconds
) or via a Prefect Automation?