Hi, I have an issue with the stability of the Pre...
# ask-community
m
Hi, I have an issue with the stability of the Prefect Server running on K8s. Sometimes, the server gets killed and is restarted in a few seconds. It works perfectly fine, but all flows running that time crash with httpcore.ConnectError exception. Do you know if there is a way to tell the executor to retry after a few seconds to prevent the flow crash? I've already tried PREFECT_CLIENT_MAX_RETRIES but it looks like it does not work for httpcore.ConnectError exception. Thanks :)
t
Not sure if this is directly related to what you are seeing, but you need to make sure your pods are not evicted/restarted while flows are running. Prefect flows are not microservices, which is what K8s is designed for. I had to set some labels for the software my clusters run
m
Yeah, I did it already, but still, there are cases when a server goes down for a very short time, and workers/executors are not able to survive. They end up in CRASHED state or even sometimes in PENDING or RUNNING.