Anyone else deal with "No heartbeat" errors? Just ...
# ask-community
c
Anyone else deal with "No heartbeat" errors? Just curious ab possible solutions and causes other than lack of compute resources? I've been load testing overnight and two pipelines just failed suddenly after 600+ runs
j
Lack of compute resources or a network issue would be my first guess. Usually this is due to the flow run process being killed by the backing platform (e.g. k8s).
c
In my run history it failed 3 times in a row (first failures in over 600 pipeline runs) and then on its fourth attempt it starts, so if it is something to do with our cluster then I need to investigate hmm...
c
We experiences a lot of these errors as well (on a GKE cluster), especially since bumping from 0.13 to 0.14 without any clear sign of origin. what seemed to help was increasing all kind of limits. but an uneasy feeling remains …
c
@Clemens Thanks for the insight! That's really good to know, thank you! We're on EKS and I figured autoscaling would sniff out a resource shortage but I'll report back to the slack if I can isolate and identify any abnormalities,
@Clemens, I found an open issue on github referring to this issue: https://github.com/PrefectHQ/prefect/issues/3058#issuecomment-770978409
@Jim Crist-Harif, hope it's okay to poke you to inquire about whether this has been discussed further internally since the ticket hasn't been closed.
j
Nope, our recommendation for users using the cluster autoscaler is to add the appropriate annotation to k8s to prevent eviction:
Copy code
"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>": "false"
You can do that by adding it to your default job template on the agent, or add it to your
KubernetesRun
run configs.
c
Thank you! Will give it a go, was hesitant since the ticket was still open!