Hi < Marvin> I prefect job were in pending state and crashin Prefect Community #marvin-ai

Hi <@ULVA73B9P>, I prefect job were in pending sta...

vineeth

08/19/2025, 10:33 AM

Hi @Marvin, I prefect job were in pending state and crashing because it doesn't have an enough space to deployed on node, because my node were already reached max pod limit so how long the prefect job could wait to deploy once other pods gets succeeded

Marvin

08/19/2025, 10:33 AM

✅ thought for 140.9 seconds

Marvin

08/19/2025, 10:36 AM

Short answer: it will wait forever. Prefect doesn’t impose a timeout on a Kubernetes pod that’s stuck Pending because the node hit its max pod limit. The run will sit until the Kubernetes scheduler finds capacity, or until something else (you) cancels it. If you’re seeing “crashed,” that’s likely due to something else (e.g., submission failed), not just Pending. Ways to control this behavior: - Cap how long a K8s Job can be “active” - Add

activeDeadlineSeconds

to your work pool’s base Job template. After that time (includes Pending time), Kubernetes will mark the Job as failed and Prefect will reflect the failure. - Example (Job spec level):

Copy code

spec:
    activeDeadlineSeconds: 1800   # 30 minutes
    backoffLimit: 0

- Use Prefect Automations to detect “hasn’t started after N minutes” and cancel/notify - Pair this with flow retries so it tries again later when capacity frees up:

Copy code

from prefect import flow

  @flow(retries=3, retry_delay_seconds=600)
  def my_flow():
      ...

- Set a concurrency limit on your work pool so the worker doesn’t launch more jobs than your cluster can schedule. - Fix it at the cluster level: raise max pods per node, add nodes/enable cluster autoscaler, or use priority classes/preemption so these pods aren’t starved. Helpful docs: - Workers and work pools: https://docs-3.prefect.io/v3/concepts/workers and https://docs-3.prefect.io/v3/concepts/work-pools - Automations: https://docs-3.prefect.io/v3/concepts/automations - Kubernetes Job termination (activeDeadlineSeconds): https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup If you can share: - Prefect version (assuming 3.x) - Worker type (Kubernetes work pool?) - Exact run state transitions and any worker logs …I can be more precise, especially about the “crashing” you mentioned.

3 Views

Open in Slack

Previous Next