Hi, can someone explain how exactly heartbeats wor...
# ask-community
j
Hi, can someone explain how exactly heartbeats work? I have a flow that reads a ton of data from snowflake and writes them to a series of tsv files, and when the run is about 20 minutes in it gets marked as failed by the Zombie killer, even though I know it’s still running. It may be helpful to mention that it’s running through a nested for loop. But digging a little deeper, I noticed that the code was printing logs about once every minute, and since the Zombie killer waits for no signal after 2 minutes I find it odd that it acts. Also, the code worked successfully when heartbeats were turned off, I’m just wondering if there is a better alternative in case this happens in the future.
k
Hey @Justin Liu, heartbeats work by spawning a lightweight subprocess on the flow runner and it polls the API. It is a separate subprocess than the one executing the task. Heartbeats work by polling the API and saying that the task is running. Normally the failed heartbeat indicates that something happened to the Flow and it signals a failure, otherwise Flows that lose connection would be marked as running forever. In this though, it seems that the subprocess may be exiting early for some reason. We have seen this is some cases where there is a long query to some API or DB. In this case, the workaround for now would be to split off the query as it’s own Flow. Turn off heartbeats there, and then call the subflow from the mainflow. If you are on 0.15.2, running this may help us diagnose this. The team has added some changes that allow the exceptions to be recorded. Do you see anything else in the logs?
j
how do you call the subflow from the mainflow? When I ran it before, after it got revived from the Lazarus process, it kept logging
Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">
. Other than that, whether heartbeat is turned on or off, the logs always end with
Copy code
Flow run SUCCESS: all reference tasks succeeded
Heartbeat process died with exit code -9
ill try running it again with heartbeat on now that I’m on 0.15.2
k
You can use the
StartFlowRun
task to start that flow.
j
i just used a quick run, but it finished with the same messages as before
k
Thanks for the info! Will ask the team about this.
z
It's possible that the heartbeat is being killed before it can send its logs out. An exit code of 9 is a SIGKILL which may not allow it to cleanup before exiting. Perhaps it's being killed due to memory constraints?
j
I tried running the task with more allocated memory, and no errors came up. Although it is strange to me that the task could still run successfully if it is getting marked as failed due to memory constraints
z
The memory reaper is not always easy to reason about, it looks like it's killing the heartbeat process but not killing your task process.