https://prefect.io logo
t

Tom Klein

08/27/2023, 7:47 PM
Looking for a some help here regarding
OOMKilled
pods of flow runs on k8s. So - putting aside for a moment the reason why it’s OOM to begin with - what happens is that the flow almost completely succeeds (or completely succeeds) - but - somehow still finds itself being run over and over and over to no end. There are no retries defined on the flow itself, and the kube manifest yaml is set to
restartPolicy = never
Yet still, we get this:
in the agent log we see an 1.5 hour gap between flow submission and “code 137” crash, despite the flow itself only lasting 15 minutes, which would seem to indicate that its the K8s that’s restarting it and not reporting it as failed to the agent (?) why is the job is seemingly in a backoff loop if its infra is set to
restartPolicy
=
never
?
Copy code
{"timestamp":1693165310104,"log":"19:41:50.103 | INFO    | prefect.agent - Reported flow run '7ac9b4e9-87d5-42ab-8598-996a024dc7c0' as crashed: Flow run infrastructure exited with non-zero status code 137.","stream":"stderr","time":"2023-08-27T19:41:50.104218118Z"}

{"timestamp":1693165309975,"log":"19:41:49.974 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'process-first-response-chunk-10-j4jgq': Job reached backoff limit.","stream":"stderr","time":"2023-08-27T19:41:49.975381483Z"}

{"timestamp":1693160108130,"log":"18:15:08.129 | INFO    | prefect.agent - Completed submission of flow run '7ac9b4e9-87d5-42ab-8598-996a024dc7c0'","stream":"stderr","time":"2023-08-27T18:15:08.130082579Z"}


{"timestamp":1693160107718,"log":"18:15:07.718 | INFO    | prefect.agent - Submitting flow run '7ac9b4e9-87d5-42ab-8598-996a024dc7c0'","stream":"stderr","time":"2023-08-27T18:15:07.718922666Z"}
After reading a bit more it looks like it might not be related to Prefect at all but just one of k8s’s weird quirks: the
never
applies to the pod but not the
JOB
entity. So
k8s
keeps creating new Pods instead of restarting the containers in the pod (if i understand correctly). Does anyone have any idea what we need to do to make sure that
OOMKilled
doesn’t lead to restarts? https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy
o

Oliver Mannion

09/02/2023, 4:28 AM
spec.backoffLimit is the number of retries before the Job fails and stops, so you could try setting this to 0 https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy
t

Tom Klein

09/02/2023, 10:41 AM
@Oliver Mannion but we don’t want to limit the retries in case of REAL transient issues (e.g. evictions, or other infra stuff) we just want
OOMKilled
to not cause a retry (since it’s pretty obvious that there either isn’t enough memory for the task - or - there’s some problem with it ) in other words, memory issues are (as far as we’re concerned) typically not transient we want more fine-tuned control over which type of crash/error causes a k8s retry and which doesn’t
o

Oliver Mannion

09/02/2023, 11:58 AM
Ah in that case you could try using a pod failure policy that fails the job on exit code 137 (ie: OOMKilled)
Copy code
podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: main      # optional
        operator: In             # one of: In, NotIn
        values: [137]
t

Tom Klein

09/02/2023, 4:12 PM
ya ok - so i wasn’t sure if that would do what i intended it too (or affect other types of failures / crashes)… so - what this effectively does is fail the JOB (and therefore the Prefect flow?) immediately - regardless of backoff defined?
o

Oliver Mannion

09/02/2023, 11:52 PM
Yeh it looks like it (I haven’t tried it myself)
🙏 1
t

Tom Klein

09/03/2023, 7:01 AM
thanks, we’ll try it then