Anyone else deal with No heartbeat errors Just curious ab po Prefect Community #ask-community

Anyone else deal with "No heartbeat" errors? Just ...

Charles Liu

03/24/2021, 6:56 PM

Anyone else deal with "No heartbeat" errors? Just curious ab possible solutions and causes other than lack of compute resources? I've been load testing overnight and two pipelines just failed suddenly after 600+ runs

Jim Crist-Harif

03/24/2021, 8:18 PM

Lack of compute resources or a network issue would be my first guess. Usually this is due to the flow run process being killed by the backing platform (e.g. k8s).

Charles Liu

03/24/2021, 9:25 PM

In my run history it failed 3 times in a row (first failures in over 600 pipeline runs) and then on its fourth attempt it starts, so if it is something to do with our cluster then I need to investigate hmm...

Clemens

03/25/2021, 10:53 AM

We experiences a lot of these errors as well (on a GKE cluster), especially since bumping from 0.13 to 0.14 without any clear sign of origin. what seemed to help was increasing all kind of limits. but an uneasy feeling remains …

Charles Liu

03/25/2021, 4:20 PM

@Clemens Thanks for the insight! That's really good to know, thank you! We're on EKS and I figured autoscaling would sniff out a resource shortage but I'll report back to the slack if I can isolate and identify any abnormalities,

Charles Liu

03/26/2021, 4:38 PM

@Clemens, I found an open issue on github referring to this issue: https://github.com/PrefectHQ/prefect/issues/3058#issuecomment-770978409

Charles Liu

03/26/2021, 4:39 PM

@Jim Crist-Harif, hope it's okay to poke you to inquire about whether this has been discussed further internally since the ticket hasn't been closed.

Jim Crist-Harif

03/26/2021, 4:41 PM

Nope, our recommendation for users using the cluster autoscaler is to add the appropriate annotation to k8s to prevent eviction:

Copy code

"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>": "false"

Jim Crist-Harif

03/26/2021, 4:41 PM

You can do that by adding it to your default job template on the agent, or add it to your

KubernetesRun

run configs.

Charles Liu

03/26/2021, 4:49 PM

Thank you! Will give it a go, was hesitant since the ticket was still open!

2 Views

Open in Slack

Previous Next