Is there a way to add retries around k8s worker jo...
# prefect-kubernetes
m
Is there a way to add retries around k8s worker job submission? I sometimes get transient errors on my job submission from the worker not being able to talk to k8s. I don't see any worker or work pool configuration but maybe I am missing it. I guess I could add an automation that responds to Crashed flows?
n
yea, this is probably what I would reach for first
add an automation that responds to Crashed flows
i would just be careful about systematic failures in submission, so you dont enter some bad loop so perhaps have some threshold of crashed events, and a notification on some number of failed submissions? or something
m
Yeah that is what I'm worried about! Automations seems a little heavy handed for a transient https issue... I was looking to see if the python k8s client supported retry configuration natively but came up empty handed
n
hmmm - yeah that ssl error is a little opaque to me 🧐
🕶️ 1
m
Ideally I'd like auto resolution vs notification / investigation / manual resolution but point taken
👍 1
n
hmm tomorrow i can ask our platform folks / k8s wizard2 s how they typically do this, they might have a k8s oriented strategy thats less heavy handed - feel free to prod here if i forget
m
Sounds good, thank you!!
gentle bump if anyone has any thoughts!
n
im not sure we have a structured recommendation on how to handle that in general (outside of a careful reactive automation to re-submit) one idea i just had though is for flows where this submission is flaky, say you have some deployment
foo
, you could have some wrapping
dispatcher
flow (on a long-lived infra perhaps? otherwise this one might have the same problem 🙂 ) that runs whenever
foo
was supposed to, and all it does is call
run_deployment
, check if it got submitted, and implements logic to handle it when it doesn't depending on the scale of your submission problem, this might be overkill, but it would be an explicit way to have full control over submission to infrastructure / retries of that
m
Gotcha, that makes sense, thank you! Would Prefect be open to a PR from me adding retries in the k8s worker code?
n
Would Prefect be open to a PR from me adding retries in the k8s worker code?
we love to see contributions! feel free to open a PR and implementation details can be discussed there 👍
m
I'm realizing there isn't an automation action to retry a flow run? https://docs.prefect.io/latest/concepts/automations/#actions
n
correct, retries are a client side thing generally for us (as in, if job submission fails, there's nothing to retry) (loop problem aside) you'd use
Run a deployment
m
The flow side retry isn't working with the Crashed state that my flow falls into 😞 it'd be ideal if my retry annotation on my flow would handle this and also solve the loop problem.
That said, I did find a prefect.kubernetes.pod.evicted event that I could trigger from that makes me feel better about the loop problem
👍 1
but I don't want to repeat work the original flow has completed, and I have external consumers monitoring for the original flow run ID, so I want retry vs a new run from run deployment
Seems like a feature request
n
that pod evicted event sounds like a good option off the top. yeah generally flow / deployment have a separation of concerns in prefect 2 (much different than prefect 1 in that respect) to reiterate my edit above
if job submission fails, there's nothing to retry
since flow retries are just talking about the flow's process basically, which never started if job submission failed
m
facepalm sorry I am conflating issues and forgot what this thread was
I am trying to retry pod evictions now, not submission failures 😁
my b, sorry for confusing the thread
n
haha its all good