Tom Klein
08/27/2023, 7:47 PMOOMKilled
pods of flow runs on k8s.
So - putting aside for a moment the reason why it’s OOM to begin with - what happens is that the flow almost completely succeeds (or completely succeeds) - but - somehow still finds itself being run over and over and over to no end.
There are no retries defined on the flow itself, and the kube manifest yaml is set to restartPolicy = never
Yet still, we get this:restartPolicy
= never
?
{"timestamp":1693165310104,"log":"19:41:50.103 | INFO | prefect.agent - Reported flow run '7ac9b4e9-87d5-42ab-8598-996a024dc7c0' as crashed: Flow run infrastructure exited with non-zero status code 137.","stream":"stderr","time":"2023-08-27T19:41:50.104218118Z"}
{"timestamp":1693165309975,"log":"19:41:49.974 | ERROR | prefect.infrastructure.kubernetes-job - Job 'process-first-response-chunk-10-j4jgq': Job reached backoff limit.","stream":"stderr","time":"2023-08-27T19:41:49.975381483Z"}
{"timestamp":1693160108130,"log":"18:15:08.129 | INFO | prefect.agent - Completed submission of flow run '7ac9b4e9-87d5-42ab-8598-996a024dc7c0'","stream":"stderr","time":"2023-08-27T18:15:08.130082579Z"}
{"timestamp":1693160107718,"log":"18:15:07.718 | INFO | prefect.agent - Submitting flow run '7ac9b4e9-87d5-42ab-8598-996a024dc7c0'","stream":"stderr","time":"2023-08-27T18:15:07.718922666Z"}
never
applies to the pod but not the JOB
entity. So k8s
keeps creating new Pods instead of restarting the containers in the pod (if i understand correctly).
Does anyone have any idea what we need to do to make sure that OOMKilled
doesn’t lead to restarts?
https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policyOliver Mannion
09/02/2023, 4:28 AMTom Klein
09/02/2023, 10:41 AMOOMKilled
to not cause a retry (since it’s pretty obvious that there either isn’t enough memory for the task - or - there’s some problem with it )
in other words, memory issues are (as far as we’re concerned) typically not transient
we want more fine-tuned control over which type of crash/error causes a k8s retry and which doesn’tOliver Mannion
09/02/2023, 11:58 AMpodFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: main # optional
operator: In # one of: In, NotIn
values: [137]
Tom Klein
09/02/2023, 4:12 PMOliver Mannion
09/02/2023, 11:52 PMTom Klein
09/03/2023, 7:01 AM