<@ULVA73B9P> I have a simple flow that users can c...
# ask-marvin
e
@Marvin I have a simple flow that users can call. However, when a user tried to create many flow runs, some of the flow runs had this error
Copy code
Job 'piquant-stingray-49qqg': Job reached backoff limit.
what might have caused this?
m
The error message "Job reached backoff limit" usually occurs when a Kubernetes job fails to complete successfully after multiple attempts. According to issues and pull requests found in Prefect's GitHub repository, this could happen for various reasons such as a container getting OOMKilled (Out Of Memory) if the flow code runs over allocated memory, leading to the Kubernetes Job controller eventually failing the job with the status "BackoffLimitExceeded". There was a mention of this issue in GitHub issue #9647 where a flow run got stuck in "Running" state because Prefect's
KubernetesJob
infrastructure was waiting for
event["object"].status.completion_time
, which wouldn't occur since the job had exceeded backoff limits due to OOMKills. This issue was addressed in a pull request (#10125) where the
KubernetesJob
block was updated to compare the number of job failures to its backoff limit. If the number of job failures exceeded the backoff limit, the agent would stop watching the job and return the status of the most recently started job pod, preventing the loop. Moreover, a pull request (#10311) was made to explicitly return status code
-1
if a Kubernetes job backoff limit has been reached. To resolve this issue for future flow runs, it is recommended to ensure the flow does not exceed resource allocations or configure memory requests and limits to match the flow's requirements more accurately. Additionally, updating to a recent version of Prefect that contains these fixes may prevent the indefinite "Running" state if such errors occur.
192 Views