Hello, we are intermittently getting “ No heartbeat detected from the remote task, marking the run as failed.” When we look at the slack channel, we detect two solutions.
1. Adding the "cluster-autoscaler.kubernetes.io/safe-to-evict": "false."
2. Another solution might be disabling the Lazarus toggle in the Prefect UI for our Plan Data Loading Flow -> Done
We applied them but still getting the same error. What is your recommendation?
05/05/2021, 2:24 AM
Hi Ismail - are you running your flows in an environment that allows for pre-emption or eviction of the jobs?
05/05/2021, 3:02 AM
Yes, we do have the eviction. Can you please share the Kubernetes command to add the annotations? Maybe we made it wrong.
05/05/2021, 5:21 AM
Sure thing - you’ll need to use a custom job template file for this particular annotation: https://docs.prefect.io/orchestration/agents/kubernetes.html#custom-job-template
I’d like to clarify that this isn’t a Prefect error but rather Prefect alerting you to an event that affected how your Flow is running (pod eviction); when the process dies and your run stops sending heartbeats, Prefect will mark all runs which claim to be running as “Failed” and then Lazarus will resurrect the Flow to continue its run from that point forward. Alternatively, you can configure all of your tasks to have retry settings that Prefect will respect (instead of marking the runs as “Failed”) so that your tasks have a chance to rerun.
The most important thing I’m trying to highlight is that if you turn off heartbeats or turn off Lazarus, you aren’t fixing anything but rather you are choosing to ignore the eviction event and prevent Prefect from helping you recover this run — instead, if you do turn these settings off, the outstanding tasks that were running will remain in a “Running” state indefinitely until you manually intervene.