Anyone have a recommendation on backoffLimit for j...
# prefect-kubernetes
k
Anyone have a recommendation on backoffLimit for job pods? Looks like the default setting is 0, so if the pod crashes k8s won't attempt to create it again. I imagine flow retries would kick in here? Wondering how flow retries and backoffLimit relate
m
Flow retries don't apply to crashes unfortunately
From my perspective my org doesn't have a great path forward until we can use Pod Failure Policies: https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy
👀 1
k
Interesting @Max Eggers. I'm seeing roughly 0.5% of flow runs crash because of "infra issues". Curious if you're experiencing a similar %? Does anyone have recommendations for setting backoffLimit? Ideally if a pod crashes (for any reason), I want the the pod to be recreated at least once and the flow executed.
k
hm, this is a tough one. we want backoff limit at 0 because of how flow runs and their identifiers are represented through our API, and because of how the k8s worker watches running jobs for terminal states. The "retry" button on the flow run page pretty much just puts that flow run id back in a scheduled state, letting the worker pick it back up and create a new k8s job whose flow run id will be the same as the previous attempt
k
So for those 0.5% of cases you recommend clicking retry? How can I automate that 🤔
k
you could use an automation with the
Change flow run's state
action and set the state to
Scheduled
but it feels a little funny that there isn't just an option called "retry flow run" alongside the "run a deployment" option. maybe there's a good reason for that but I'm not entirely sure
k
I guess there had to be some way to separate flow-run retries from a re-submit flow aka set to scheduled
k
since task result persistence is on a per-flow-run-basis retrying is the thing we actually want. "run a deployment" will start a whole new flow run with a new id, so then caching would need to be enabled which imo shouldn't be necessary to get the desired behavior
m
Nice I didn't realize that https://github.com/PrefectHQ/prefect/issues/11285 had been implemented. I think there is technically a possibility of an infinite retry though?
k
yeah that occurred to me too
m
We are looking at retrying from another flow that gets kicked off based on state changes of the first flow. So then on crashed we could examine retries and conditionally retry
But ideally
I would love
cc @George Coyne 😁
Basically it'd be great if evictions were retried automatically at the infra layer
🙌 1
g
Working through tests of our asynchronous implementation right now!
🎉 2
m
🎉 🎉 🎉 🎉 🎉