Hi,
I have an issue with the stability of the Prefect Server running on K8s. Sometimes, the server gets killed and is restarted in a few seconds. It works perfectly fine, but all flows running that time crash with httpcore.ConnectError exception.
Do you know if there is a way to tell the executor to retry after a few seconds to prevent the flow crash? I've already tried PREFECT_CLIENT_MAX_RETRIES but it looks like it does not work for httpcore.ConnectError exception.
Thanks :)
t
Tom Jordahl
11/20/2024, 9:23 PM
Not sure if this is directly related to what you are seeing, but you need to make sure your pods are not evicted/restarted while flows are running. Prefect flows are not microservices, which is what K8s is designed for. I had to set some labels for the software my clusters run
m
Maciej Kluczny
11/21/2024, 9:27 AM
Yeah, I did it already, but still, there are cases when a server goes down for a very short time, and workers/executors are not able to survive. They end up in CRASHED state or even sometimes in PENDING or RUNNING.
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.