https://prefect.io logo
Title
m

Matt Allen

10/01/2020, 3:57 PM
Is there somewhere I can find more info about task health checks? I've got some flows failing because of them and it's not clear what might actually cause that. Are these healthchecks run in a thread from the flow process or something like that?
c

Chris White

10/01/2020, 4:00 PM
Hi Matt - unfortunately it is not documented very well, but the heartbeats are run in a subprocess. Code is here: https://github.com/PrefectHQ/prefect/blob/master/src/prefect/utilities/executors.py#L40-L73 The command that is run is a prefect CLI command
prefect heartbeat task-run -i TASK_RUN_ID
similarly with the flow runner process, except it will use the CLI command
prefect heartbeat flow-run -i FLOW_RUN_ID
m

Matt Allen

10/01/2020, 4:34 PM
Hmm, ok. So for something to block the heartbeat process in theory it would need to be eating the entire CPU budget for the job, which seems unlikely for what I'm doing
How many heartbeats are allowed to fail before the agent will kill the task?
c

Chris White

10/01/2020, 4:36 PM
So interestingly the agent never kills the task - these heartbeats are sent directly to the backend API, and the “zombie killer” will mark them as failed if they aren’t updated after 2 minutes (which I believe corresponds to 4 heartbeat misses). How consistently is it happening?
m

Matt Allen

10/01/2020, 4:45 PM
On one hourly flow I have 3 in the last day
We're having some general consistency issues around this flow so it's possible this is a symptom of another problem (like the database getting locked during the run), but if it's running in a subprocess I wouldn't think that would cause missed heartbeats
Unless the job is getting killed by kube and not reporting it's state change to the server...
c

Chris White

10/01/2020, 4:51 PM
yea, I’m open to adding a DEBUG level log that that CLI call to see if we can capture anything. Also a k8s eviction of some type could definitely be happening - evictions are hard to capture, although we are trying to increase visibility into those sorts of events
m

Matt Allen

10/01/2020, 5:28 PM
On further investigation I see the task just stops logging in datadog at the same time the log ends in prefect. Looks like this is an issue where the pod died for some reason and it just took a while for prefect to notice. Thanks for the help
c

Chris White

10/01/2020, 5:45 PM
Anytime! Let us know if you see a way for the Agent to detect your situation and create an ERROR / CRITICAL log for the flow run