Trying to understand why some flows die due to Zom...
# prefect-community
d
Trying to understand why some flows die due to Zombie Killer when other, and then others, similar, don't. Sometimes see tasks succeed after 5 minutes or more without being killed, while others killed after 3 mins or so. With Zombie Killer enabled, does every @task wrapped function need to complete in < 2 minutes or risk being terminated? For example, if a task includes a subprocess.run() that does some long running work, is the heartbeat blocked until subprocess.run completes?
👀 1
d
Hi @Dan DiPasquo! Heartbeats are sent by Prefect Core every 30 seconds and are used to confirm the flow run and its associated task runs are healthy; runs missing four heartbeats in a row are marked as Failed by the Zombie Killer . You can read more about heartbeats here: https://docs.prefect.io/orchestration/concepts/flows.html#flow-settings You can turn off heartbeats on the settings page for your flow.
I’m not sure about
subprocess.run()
let me check with my team and get back to you
@Dan DiPasquo Heartbeats run in their own subprocess and are not blocking for calls to
subprocess.run()
d
Hi @Dylan Thank you. Yes, I read that section of the docs; I suppose my question is, then, what sort of things DO prevent Core from sending a keep-alive heartbeat, such that it might narrow down my search for the underlying cause of my flows being killed?
d
We once experienced some problems with heartbeats as they relate to version locking and Dask
If you turn them off, you can set timeouts on your tasks so they don’t run forever
But most often, tasks die because they run out of memory or other resources on the node they’re running on
d
I have found some log errors concurrent with tasks that have been kiilled by Zombie Killer, e.g.
Copy code
Failed to set task state with error: ClientError([{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [{'line': 2, 'column': 5}], 'path': ['set_task_run_states'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'errors': [{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [], 'path': ['set_task_run_states']}]}}}])
T
Copy code
2020-05-14T22:31:52.702149+00:00  ERROR    Heartbeat failed for Task 'delete_local_file'
                                           Traceback (most recent call last):
                                             File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 64, in _heartbeat
                                               self.client.update_task_run_heartbeat(task_run_id)
                                             File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 910, in update_task_run_heartbeat
                                               self.graphql(mutation, raise_on_error=True)
                                             File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 228, in graphql
                                               raise ClientError(result["errors"])
                                           prefect.utilities.exceptions.ClientError: [{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [{'line': 2, 'column': 5}], 'path': ['update_task_run_heartbeat'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'errors': [{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [], 'path': ['update_task_run_heartbeat']}]}}}]
c
Hey Dan, at first glance that looks like a networking issue on our side 🧐
d
These are Flows that seem to be successfully sending status changes both before and after (wall clock/log time) these errors occur in logs.
c
is this consistently happening?
d
It's say it's regularly, but not consistently. As we've recently scaled up, ~10% of our flow runs end with Zombie Killer
c
Hmmmm that sounds suspicious — could you describe the runtime environment for these flows? K8s / Dask / Docker / Local / etc.?
d
Rerun of same flows usually succeed.
k8s via GKE
It's usually one of several longer-running tasks (makes sense, more heartbeats)- but also observed in a task that should be very shortlived -
There's a good run of these ClientErrors in logs @ 2020-05-14T223152.485472+00:00
give or take 100ms
🙂
c
Yea — those client errors appear to correlate with some increased error rates on our side; apologies for that, we’re looking into what might have triggered it. In the meantime, it sounds like you have a zombie issue regardless — I recommend turning off heartbeats for now. If you’re able to share the docker image you’re running on that might be useful here — my suspicion is that there is some resource limit we’re hitting with respect to the number of subprocesses that can run smoothly
d
Yes I don't find these ClientErrors in the logs of every flow killed by Zombie Killer so there may be more than one thing going on.
I don't know about sharing the docker image, I will check with colleagues. I think it would be very hard for you to run the flow as it reaches into GCS for specific resources., etc. In the meantime I may disable Zombie Killer as suggested and see if it illuminates anything.
Thanks for your help!
c
Anytime! Keep us in the loop if you see anything else that seems relevant here; we actually have a deep dive scheduled on Monday with another user who sounds like they’re running into a similar issue so we can report back if we learn anything from that
d
Thanks, if i figure anything out or even find any more clues I'll let you know.
👍 1
By way of another clue, I ran another set of similar Flow runs, this time with Zombie Killer disabled, 96 out of 98 succeeded, with 2 running indefinitely. When I dug into these two, I found that one of the tasks that is a typical victim of Zombie Killer when it's been enabled, this time these tasks never completed successfully, but never Failed either. In fact, logs show that task.run() was called a second time (not seen in the 96 successful flow runs) , and bizarrely, the second flow run took place in a different k8s pod! Log:
Copy code
2020-05-15T01:14:47.611286+00:00  INFO     Task '<****>': Starting task run...
2020-05-15T01:14:50.682754+00:00  DEBUG    Task '<****>': 751 candidate cached states were found
2020-05-15T01:14:50.683424+00:00  DEBUG    Task '<****>': can't use cache because no candidate Cached states were valid
..... repeat log 749 times...
2020-05-15T01:14:50.68382+00:00   DEBUG    Task '<****>': can't use cache because no candidate Cached states were valid
2020-05-15T01:14:50.803267+00:00  DEBUG    Task '<****>': Handling state change from Pending to Running
2020-05-15T01:14:51.055126+00:00  DEBUG    Task '<****>': Calling task.run() method...
2020-05-15T01:19:16.827716+00:00  INFO     Beginning Flow run for '<****>>-production'
2020-05-15T01:19:17.488745+00:00  INFO     Task '<****>': Starting task run...
2020-05-15T01:19:17.489122+00:00  DEBUG    Task '<****>': task is already running.
2020-05-15T01:19:17.497588+00:00  INFO     Task '<****>': finished task run for task with final state: 'Running'
I traced the first bit of that logging was traced to one k8s pod, ending suddenly at Calling task.run() method... And the second part of that logging traced to a different pod that seemingly did a second flow.run() on same Flow and immediately started with a task that typically has upstream dependencies .... so, I think this is a clue, but I am at loss from here, right now.