Looking for a some help here regarding `OOMKilled` pods of f Prefect Community #prefect-kubernetes

Looking for a some help here regarding `OOMKilled`...

Tom Klein

08/27/2023, 7:47 PM

Looking for a some help here regarding

OOMKilled

pods of flow runs on k8s. So - putting aside for a moment the reason why it’s OOM to begin with - what happens is that the flow almost completely succeeds (or completely succeeds) - but - somehow still finds itself being run over and over and over to no end. There are no retries defined on the flow itself, and the kube manifest yaml is set to

restartPolicy = never

Yet still, we get this:

Tom Klein

08/27/2023, 9:09 PM

in the agent log we see an 1.5 hour gap between flow submission and “code 137” crash, despite the flow itself only lasting 15 minutes, which would seem to indicate that its the K8s that’s restarting it and not reporting it as failed to the agent (?) why is the job is seemingly in a backoff loop if its infra is set to

restartPolicy

never

Copy code

{"timestamp":1693165310104,"log":"19:41:50.103 | INFO    | prefect.agent - Reported flow run '7ac9b4e9-87d5-42ab-8598-996a024dc7c0' as crashed: Flow run infrastructure exited with non-zero status code 137.","stream":"stderr","time":"2023-08-27T19:41:50.104218118Z"}

{"timestamp":1693165309975,"log":"19:41:49.974 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'process-first-response-chunk-10-j4jgq': Job reached backoff limit.","stream":"stderr","time":"2023-08-27T19:41:49.975381483Z"}

{"timestamp":1693160108130,"log":"18:15:08.129 | INFO    | prefect.agent - Completed submission of flow run '7ac9b4e9-87d5-42ab-8598-996a024dc7c0'","stream":"stderr","time":"2023-08-27T18:15:08.130082579Z"}


{"timestamp":1693160107718,"log":"18:15:07.718 | INFO    | prefect.agent - Submitting flow run '7ac9b4e9-87d5-42ab-8598-996a024dc7c0'","stream":"stderr","time":"2023-08-27T18:15:07.718922666Z"}

Tom Klein

08/27/2023, 9:40 PM

After reading a bit more it looks like it might not be related to Prefect at all but just one of k8s’s weird quirks: the

never

applies to the pod but not the

JOB

entity. So

k8s

keeps creating new Pods instead of restarting the containers in the pod (if i understand correctly). Does anyone have any idea what we need to do to make sure that

OOMKilled

doesn’t lead to restarts? https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

Oliver Mannion

09/02/2023, 4:28 AM

spec.backoffLimit is the number of retries before the Job fails and stops, so you could try setting this to 0 https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy

Tom Klein

09/02/2023, 10:41 AM

@Oliver Mannion but we don’t want to limit the retries in case of REAL transient issues (e.g. evictions, or other infra stuff) we just want

OOMKilled

to not cause a retry (since it’s pretty obvious that there either isn’t enough memory for the task - or - there’s some problem with it ) in other words, memory issues are (as far as we’re concerned) typically not transient we want more fine-tuned control over which type of crash/error causes a k8s retry and which doesn’t

Oliver Mannion

09/02/2023, 11:58 AM

Ah in that case you could try using a pod failure policy that fails the job on exit code 137 (ie: OOMKilled)

Copy code

podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: main      # optional
        operator: In             # one of: In, NotIn
        values: [137]

Tom Klein

09/02/2023, 4:12 PM

ya ok - so i wasn’t sure if that would do what i intended it too (or affect other types of failures / crashes)… so - what this effectively does is fail the JOB (and therefore the Prefect flow?) immediately - regardless of backoff defined?

Oliver Mannion

09/02/2023, 11:52 PM

Yeh it looks like it (I haven’t tried it myself)

🙏 1

Tom Klein

09/03/2023, 7:01 AM

thanks, we’ll try it then

9 Views

Open in Slack

Previous Next