Mike O'Connor

12/15/2022, 11:56 AM
Hi 👋 . I’m having trouble with Prefect Agents not reporting failed KubernetesJobs back to Prefect Cloud. I’m specifically trying to ensure pods do not consume too much memory, so have the following test. expected behaviour: • Given a job with pod configuration to limit memory to 1GB. • When I run a flow which creates 2GB of memory • The flow run fails. actual behaviour: • Kubernetes correctly kills the pod, with OOMKilled status. • The agent detects the job failed. • But it does not report it as failed to Prefect Cloud, it stays as ‘running’ forever. agent log snippet:
11:32:09.162 | INFO    | prefect.infrastructure.kubernetes-job - Job 'prefect-standard-kubernetes-jobdmmq2': Pod has status 'Pending'.
11:32:11.003 | INFO    | prefect.infrastructure.kubernetes-job - Job 'prefect-standard-kubernetes-jobdmmq2': Pod has status 'Running'.
11:32:21.064 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'prefect-standard-kubernetes-jobdmmq2': Job did not complete.
pod snippet:
prefect-standard-kubernetes-jobdmmq2-cpfz4           0/1     OOMKilled   0          23m
Is there some configuration I need to do to here, or is this unsupported, or a bug?


12/15/2022, 1:12 PM
Does Kubernetes also restart the failed pod for you as it does for me? (And then Prefect complains that it can't move run from running state to running state, so nothing really happens.) I added
{"op": "add", "path": "/spec/backoffLimit", "value": 0},
to the
to stop this restart and only then PrefectCloud correctly identified the job as failed not running.

Mike O'Connor

12/15/2022, 2:32 PM
yeah i had the same thing!
that’s a great tip, i’ll try that, thanks
hmm, sadly that didn’t work. It stopped it spawning a second pod, but the flow run is still left hanging as “running”
Wanted to come back here and say kubernetes jobs now gracefully die if they run out of memory and report a failure on prefect cloud as of prefect 2.7.10, thanks for fixing!
👍 1