Hi wave I m having trouble with Prefect Agents not reporting Prefect Community #ask-community

Hi :wave: . I’m having trouble with Prefect Agents...

Mike O'Connor

12/15/2022, 11:56 AM

Hi 👋 . I’m having trouble with Prefect Agents not reporting failed KubernetesJobs back to Prefect Cloud. I’m specifically trying to ensure pods do not consume too much memory, so have the following test. expected behaviour: • Given a job with pod configuration to limit memory to 1GB. • When I run a flow which creates 2GB of memory • The flow run fails. actual behaviour: • Kubernetes correctly kills the pod, with OOMKilled status. • The agent detects the job failed. • But it does not report it as failed to Prefect Cloud, it stays as ‘running’ forever. agent log snippet:

Copy code

11:32:09.162 | INFO    | prefect.infrastructure.kubernetes-job - Job 'prefect-standard-kubernetes-jobdmmq2': Pod has status 'Pending'.
11:32:11.003 | INFO    | prefect.infrastructure.kubernetes-job - Job 'prefect-standard-kubernetes-jobdmmq2': Pod has status 'Running'.
11:32:21.064 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'prefect-standard-kubernetes-jobdmmq2': Job did not complete.

pod snippet:

Copy code

prefect-standard-kubernetes-jobdmmq2-cpfz4           0/1     OOMKilled   0          23m

Is there some configuration I need to do to here, or is this unsupported, or a bug?

quassy

12/15/2022, 1:12 PM

Does Kubernetes also restart the failed pod for you as it does for me? (And then Prefect complains that it can't move run from running state to running state, so nothing really happens.) I added

{"op": "add", "path": "/spec/backoffLimit", "value": 0},

to the

KubernetesJob(customizations=...)

to stop this restart and only then PrefectCloud correctly identified the job as failed not running.

Mike O'Connor

12/15/2022, 2:32 PM

yeah i had the same thing!

Mike O'Connor

12/15/2022, 2:33 PM

that’s a great tip, i’ll try that, thanks

Mike O'Connor

12/15/2022, 3:43 PM

hmm, sadly that didn’t work. It stopped it spawning a second pod, but the flow run is still left hanging as “running”

Mike O'Connor

02/06/2023, 5:21 PM

Wanted to come back here and say kubernetes jobs now gracefully die if they run out of memory and report a failure on prefect cloud as of prefect 2.7.10, thanks for fixing!

👍 1

3 Views

Open in Slack

Previous Next