Eric
03/01/2024, 9:16 PMJob 'piquant-stingray-49qqg': Job reached backoff limit.
what might have caused this?Marvin
03/01/2024, 9:16 PMKubernetesJob
infrastructure was waiting for event["object"].status.completion_time
, which wouldn't occur since the job had exceeded backoff limits due to OOMKills.
This issue was addressed in a pull request (#10125) where the KubernetesJob
block was updated to compare the number of job failures to its backoff limit. If the number of job failures exceeded the backoff limit, the agent would stop watching the job and return the status of the most recently started job pod, preventing the loop.
Moreover, a pull request (#10311) was made to explicitly return status code -1
if a Kubernetes job backoff limit has been reached.
To resolve this issue for future flow runs, it is recommended to ensure the flow does not exceed resource allocations or configure memory requests and limits to match the flow's requirements more accurately. Additionally, updating to a recent version of Prefect that contains these fixes may prevent the indefinite "Running" state if such errors occur.