Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Hi,

I have an issue with the stability of the Prefect Server running on K8s. Sometimes, the server gets killed and is restarted in a few seconds. It works perfectly fine, but all flows running that time crash with httpcore.ConnectError exception.
Do you know if there is a way to tell the executor to retry after a few seconds to prevent the flow crash? I've already tried PREFECT_CLIENT_MAX_RETRIES but it looks like it does not work for httpcore.ConnectError exception.

Thanks :)

Not sure if this is directly related to what you are seeing, but you need to make sure your pods are not evicted/restarted while flows are running.  Prefect flows are not microservices, which is what K8s is designed for.  I had to set some labels for the software my clusters run

Yeah, I did it already, but still, there are cases when a server goes down for a very short time, and workers/executors are not able to survive. They end up in CRASHED state or even sometimes in PENDING or RUNNING.