Is there a way to add retries around k8s worker job submissi Prefect Community #prefect-kubernetes

Is there a way to add retries around k8s worker jo...

Max Eggers

12/12/2023, 11:47 PM

Is there a way to add retries around k8s worker job submission? I sometimes get transient errors on my job submission from the worker not being able to talk to k8s. I don't see any worker or work pool configuration but maybe I am missing it. I guess I could add an automation that responds to Crashed flows?

transient_error_stacktrace.txt

Nate

12/13/2023, 12:26 AM

yea, this is probably what I would reach for first

add an automation that responds to Crashed flows

Nate

12/13/2023, 12:27 AM

i would just be careful about systematic failures in submission, so you dont enter some bad loop so perhaps have some threshold of crashed events, and a notification on some number of failed submissions? or something

Max Eggers

12/13/2023, 12:29 AM

Yeah that is what I'm worried about! Automations seems a little heavy handed for a transient https issue... I was looking to see if the python k8s client supported retry configuration natively but came up empty handed

Nate

12/13/2023, 12:29 AM

hmmm - yeah that ssl error is a little opaque to me 🧐

🕶️ 1

Max Eggers

12/13/2023, 12:29 AM

Ideally I'd like auto resolution vs notification / investigation / manual resolution but point taken

👍 1

Nate

12/13/2023, 12:34 AM

hmm tomorrow i can ask our platform folks / k8s wizard2 s how they typically do this, they might have a k8s oriented strategy thats less heavy handed - feel free to prod here if i forget

Max Eggers

12/13/2023, 12:53 AM

Sounds good, thank you!!

Max Eggers

12/13/2023, 4:32 PM

gentle bump if anyone has any thoughts!

Nate

12/13/2023, 5:06 PM

im not sure we have a structured recommendation on how to handle that in general (outside of a careful reactive automation to re-submit) one idea i just had though is for flows where this submission is flaky, say you have some deployment

foo

, you could have some wrapping

dispatcher

flow (on a long-lived infra perhaps? otherwise this one might have the same problem 🙂 ) that runs whenever

foo

was supposed to, and all it does is call

run_deployment

, check if it got submitted, and implements logic to handle it when it doesn't depending on the scale of your submission problem, this might be overkill, but it would be an explicit way to have full control over submission to infrastructure / retries of that

Max Eggers

12/13/2023, 5:46 PM

Gotcha, that makes sense, thank you! Would Prefect be open to a PR from me adding retries in the k8s worker code?

Nate

12/13/2023, 7:31 PM

Would Prefect be open to a PR from me adding retries in the k8s worker code?

we love to see contributions! feel free to open a PR and implementation details can be discussed there 👍

Max Eggers

12/14/2023, 10:11 PM

I'm realizing there isn't an automation action to retry a flow run? https://docs.prefect.io/latest/concepts/automations/#actions

Nate

12/14/2023, 10:13 PM

correct, retries are a client side thing generally for us (as in, if job submission fails, there's nothing to retry) (loop problem aside) you'd use

Run a deployment

Max Eggers

12/14/2023, 10:14 PM

The flow side retry isn't working with the Crashed state that my flow falls into 😞 it'd be ideal if my retry annotation on my flow would handle this and also solve the loop problem.

Max Eggers

12/14/2023, 10:15 PM

That said, I did find a prefect.kubernetes.pod.evicted event that I could trigger from that makes me feel better about the loop problem

👍 1

Max Eggers

12/14/2023, 10:15 PM

but I don't want to repeat work the original flow has completed, and I have external consumers monitoring for the original flow run ID, so I want retry vs a new run from run deployment

Max Eggers

12/14/2023, 10:16 PM

Seems like a feature request

Nate

12/14/2023, 10:17 PM

that pod evicted event sounds like a good option off the top. yeah generally flow / deployment have a separation of concerns in prefect 2 (much different than prefect 1 in that respect) to reiterate my edit above

if job submission fails, there's nothing to retry

since flow retries are just talking about the flow's process basically, which never started if job submission failed

Max Eggers

12/14/2023, 10:18 PM

facepalm sorry I am conflating issues and forgot what this thread was

Max Eggers

12/14/2023, 10:18 PM

I am trying to retry pod evictions now, not submission failures 😁

Max Eggers

12/14/2023, 10:18 PM

my b, sorry for confusing the thread

Nate

12/14/2023, 10:18 PM

haha its all good

22 Views

Open in Slack

Previous Next