Anyone have a recommendation on backoffLimit for job pods Lo Prefect Community #prefect-kubernetes

Anyone have a recommendation on backoffLimit for j...

05/22/2024, 6:07 PM

Anyone have a recommendation on backoffLimit for job pods? Looks like the default setting is 0, so if the pod crashes k8s won't attempt to create it again. I imagine flow retries would kick in here? Wondering how flow retries and backoffLimit relate

Max Eggers

05/22/2024, 9:57 PM

Flow retries don't apply to crashes unfortunately

Max Eggers

05/22/2024, 9:59 PM

From my perspective my org doesn't have a great path forward until we can use Pod Failure Policies: https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

👀 1

06/03/2024, 2:36 PM

Interesting @Max Eggers. I'm seeing roughly 0.5% of flow runs crash because of "infra issues". Curious if you're experiencing a similar %? Does anyone have recommendations for setting backoffLimit? Ideally if a pod crashes (for any reason), I want the the pod to be recreated at least once and the flow executed.

Kevin Grismore

06/03/2024, 2:43 PM

hm, this is a tough one. we want backoff limit at 0 because of how flow runs and their identifiers are represented through our API, and because of how the k8s worker watches running jobs for terminal states. The "retry" button on the flow run page pretty much just puts that flow run id back in a scheduled state, letting the worker pick it back up and create a new k8s job whose flow run id will be the same as the previous attempt

06/03/2024, 2:45 PM

So for those 0.5% of cases you recommend clicking retry? How can I automate that 🤔

Kevin Grismore

06/03/2024, 2:45 PM

you could use an automation with the

Change flow run's state

action and set the state to

Scheduled

Kevin Grismore

06/03/2024, 2:46 PM

but it feels a little funny that there isn't just an option called "retry flow run" alongside the "run a deployment" option. maybe there's a good reason for that but I'm not entirely sure

06/03/2024, 2:49 PM

I guess there had to be some way to separate flow-run retries from a re-submit flow aka set to scheduled

Kevin Grismore

06/03/2024, 2:49 PM

since task result persistence is on a per-flow-run-basis retrying is the thing we actually want. "run a deployment" will start a whole new flow run with a new id, so then caching would need to be enabled which imo shouldn't be necessary to get the desired behavior

Max Eggers

06/03/2024, 2:52 PM

Nice I didn't realize that https://github.com/PrefectHQ/prefect/issues/11285 had been implemented. I think there is technically a possibility of an infinite retry though?

Kevin Grismore

06/03/2024, 2:52 PM

yeah that occurred to me too

Max Eggers

06/03/2024, 2:52 PM

We are looking at retrying from another flow that gets kicked off based on state changes of the first flow. So then on crashed we could examine retries and conditionally retry

Max Eggers

06/03/2024, 2:52 PM

But ideally

Max Eggers

06/03/2024, 2:52 PM

I would love

Max Eggers

06/03/2024, 2:52 PM

https://github.com/PrefectHQ/prefect/issues/12988

Max Eggers

06/03/2024, 2:53 PM

cc @George Coyne 😁

Max Eggers

06/03/2024, 2:53 PM

Basically it'd be great if evictions were retried automatically at the infra layer

🙌 1

George Coyne

06/03/2024, 2:53 PM

Working through tests of our asynchronous implementation right now!

🎉 2

Max Eggers

06/03/2024, 2:53 PM

🎉 🎉 🎉 🎉 🎉

12 Views

Open in Slack

Previous Next