< Marvin> if a pod takes a minute to start after Prefect has Prefect Community #ask-marvin

<@ULVA73B9P> if a pod takes a minute to start afte...

Scott Pedersen

07/23/2025, 12:48 AM

@Marvin if a pod takes a minute to start after Prefect has created the Kubernetes job, Prefect will consider this a crash. Is there a parameter that can be tweaked to increase the time, if a pod takes longer than normal to start?

Marvin

07/23/2025, 12:48 AM

✅ thought for 37.2 seconds

Marvin

07/23/2025, 12:49 AM

Based on my research, I can help you with the pod startup timeout issue. Yes, there is a specific parameter you can adjust to increase the time Prefect waits for a pod to start before considering it a crash. In Prefect 3.x, you can configure the

Pod Watch Timeout Seconds

parameter. This setting determines how long Prefect will wait for a pod to start before considering it failed. Here's how you can configure it: 1. When creating your work pool, you can set this in the base job template. For example:

Copy code

json
{
    "job_configuration": {
        "pod_watch_timeout_seconds": 300  // 5 minutes
    }
}

The default value is typically around 60 seconds (1 minute), which is why you're seeing the timeout. Setting it to a higher value like 300 seconds (5 minutes) gives your pods more time to start up, which is especially useful when: - Using serverless node pools that need time to spin up - Working with large container images that take longer to pull - Dealing with resource constraints that might delay pod scheduling This setting is particularly important when: - Your cluster is using auto-scaling node groups - You're running on serverless Kubernetes offerings - Your pods require significant initialization time You can find more details about Kubernetes configuration in the Prefect documentation. Would you like me to provide more specific guidance on how to update this setting for your particular setup?

Scott Pedersen

07/23/2025, 12:49 AM

@Marvin is the

pod_watch_timeout_seconds

available in Prefect 2.20.16?

Marvin

07/23/2025, 12:49 AM

✅ thought for 31.3 seconds

Marvin

07/23/2025, 12:50 AM

Yes, I can confirm that

pod_watch_timeout_seconds

is available in Prefect 2.20.16. You can configure it in two ways: 1. In your Kubernetes work pool's base job template:

Copy code

json
{
    "job_configuration": {
        "pod_watch_timeout_seconds": 300  // 5 minutes
    }
}

2. Or when creating a KubernetesJob directly in code:

Copy code

python
from prefect.infrastructure import KubernetesJob

kubernetes_job = KubernetesJob(
    pod_watch_timeout_seconds=300
)

The default value is 60 seconds, so increasing it to 300 seconds (5 minutes) or more should help with pods that take longer to start up. This setting has been available in Prefect 2.x for quite a while and is definitely present in version 2.20.16. It serves the same purpose as in Prefect 3.x - it determines how long Prefect will wait for a pod to start before considering it failed. Would you like me to provide more specific guidance on how to implement this in your particular setup?

Scott Pedersen

07/23/2025, 1:28 AM

@Marvin what does the default base job template look like?

Marvin

07/23/2025, 1:28 AM

✅ thought for 163.6 seconds

Marvin

07/23/2025, 1:31 AM

Based on my research, here's the default base job template for a Kubernetes work pool in Prefect:

Copy code

json
{
    "variables": {
        "type": "object",
        "properties": {
            "image": {
                "type": "string",
                "title": "Image",
                "description": "The image to use for the job"
            },
            "namespace": {
                "type": "string",
                "title": "Namespace",
                "default": "default",
                "description": "The Kubernetes namespace to create jobs in"
            }
        },
        "required": ["image"]
    },
    "job_configuration": {
        "job_manifest": {
            "apiVersion": "batch/v1",
            "kind": "Job",
            "metadata": {
                "namespace": "{{ namespace }}"
            },
            "spec": {
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "name": "prefect-job",
                                "image": "{{ image }}",
                                "resources": {
                                    "requests": {
                                        "cpu": "100m",
                                        "memory": "256Mi"
                                    },
                                    "limits": {
                                        "memory": "512Mi"
                                    }
                                }
                            }
                        ],
                        "restartPolicy": "Never"
                    }
                },
                "backoffLimit": 0
            }
        },
        "pod_watch_timeout_seconds": 60,
        "job_watch_timeout_seconds": 3600,
        "stream_output": true
    }
}

Key points about the default template: 1. Default timeouts: - `pod_watch_timeout_seconds`: 60 seconds (1 minute) - `job_watch_timeout_seconds`: 3600 seconds (1 hour) 2. Default resource requests: - CPU: 100m (0.1 CPU cores) - Memory Request: 256Mi - Memory Limit: 512Mi 3. Important default settings: - `restartPolicy`: "Never" - `backoffLimit`: 0 (job will not be retried if it fails) - `stream_output`: true (logs will be streamed to Prefect) You can customize this template when creating a work pool either through the UI or using the CLI with a JSON file. For example, to increase the pod watch timeout, you would modify the

pod_watch_timeout_seconds

value in the

job_configuration

section. Would you like me to explain how to modify any specific part of this template for your use case?

Scott Pedersen

07/23/2025, 1:54 AM

@Marvin, what does the YAML version of the job manifest look like?

Marvin

07/23/2025, 1:54 AM

✅ thought for 8.9 seconds

Marvin

07/23/2025, 1:55 AM

Let me convert the job manifest portion of the default template to YAML format for better readability. The default Kubernetes job manifest in YAML format looks like this:

Copy code

yaml
apiVersion: batch/v1
kind: Job
metadata:
  namespace: "{{ namespace }}"
spec:
  template:
    spec:
      containers:
        - name: prefect-job
          image: "{{ image }}"
          resources:
            requests:
              cpu: "100m"
              memory: "256Mi"
            limits:
              memory: "512Mi"
      restartPolicy: Never
  backoffLimit: 0

This is the core job manifest that Prefect uses by default. You can customize it by adding additional fields that Kubernetes jobs support, such as:

Copy code

yaml
apiVersion: batch/v1
kind: Job
metadata:
  namespace: "{{ namespace }}"
  labels:
    app: prefect-job
spec:
  template:
    metadata:
      labels:
        app: prefect-job
    spec:
      containers:
        - name: prefect-job
          image: "{{ image }}"
          resources:
            requests:
              cpu: "100m"
              memory: "256Mi"
            limits:
              memory: "512Mi"
          env:
            - name: PREFECT_API_KEY
              valueFrom:
                secretKeyRef:
                  name: prefect-api-key
                  key: key
          volumeMounts:
            - name: my-volume
              mountPath: /data
      volumes:
        - name: my-volume
          emptyDir: {}
      imagePullSecrets:
        - name: my-registry-secret
      restartPolicy: Never
      serviceAccountName: prefect-service-account
  backoffLimit: 0

Common customizations include: - Adding environment variables - Mounting volumes - Setting image pull secrets - Configuring service accounts - Adding node selectors or tolerations - Setting security contexts - Adding labels and annotations Would you like me to elaborate on any specific customization aspect?

Scott Pedersen

07/23/2025, 1:56 AM

@Marvin where do a put

pod_watch_timeout_seconds

, and

job_watch_timeout_seconds

Marvin

07/23/2025, 1:56 AM

✅ thought for 9.3 seconds

Marvin

07/23/2025, 1:56 AM

The

pod_watch_timeout_seconds

and

job_watch_timeout_seconds

are not part of the Kubernetes job manifest itself - they are Prefect-specific configurations that go in the outer

job_configuration

section of the work pool's base job template. Here's the complete structure showing where these timeout settings belong:

Copy code

yaml
variables:
  type: object
  properties:
    image:
      type: string
      title: Image
      description: The image to use for the job
    namespace:
      type: string
      title: Namespace
      default: default
      description: The Kubernetes namespace to create jobs in
  required:
    - image

job_configuration:
  # Prefect-specific settings go here at this level
  pod_watch_timeout_seconds: 300  # Example: increased to 5 minutes
  job_watch_timeout_seconds: 3600 # Default: 1 hour
  stream_output: true

  # The Kubernetes job manifest goes under job_manifest
  job_manifest:
    apiVersion: batch/v1
    kind: Job
    metadata:
      namespace: "{{ namespace }}"
    spec:
      template:
        spec:
          containers:
            - name: prefect-job
              image: "{{ image }}"
              resources:
                requests:
                  cpu: "100m"
                  memory: "256Mi"
                limits:
                  memory: "512Mi"
          restartPolicy: Never
      backoffLimit: 0

Key points: 1. These timeout settings are Prefect configurations, not Kubernetes configurations 2. They should be at the same level as

job_manifest

within the

job_configuration

section 3. They are not part of the Kubernetes YAML specification When creating or updating a work pool, you would save this entire configuration (including both the Prefect-specific settings and the Kubernetes job manifest) as a single YAML file.

2 Views

Open in Slack

Previous Next