Trying to understand why some flows die due to Zombie Killer Prefect Community #ask-community

Trying to understand why some flows die due to Zom...

Dan DiPasquo

05/14/2020, 9:23 PM

Trying to understand why some flows die due to Zombie Killer when other, and then others, similar, don't. Sometimes see tasks succeed after 5 minutes or more without being killed, while others killed after 3 mins or so. With Zombie Killer enabled, does every @task wrapped function need to complete in < 2 minutes or risk being terminated? For example, if a task includes a subprocess.run() that does some long running work, is the heartbeat blocked until subprocess.run completes?

👀 1

Dylan

05/14/2020, 9:31 PM

Hi @Dan DiPasquo! Heartbeats are sent by Prefect Core every 30 seconds and are used to confirm the flow run and its associated task runs are healthy; runs missing four heartbeats in a row are marked as Failed by the Zombie Killer . You can read more about heartbeats here: https://docs.prefect.io/orchestration/concepts/flows.html#flow-settings You can turn off heartbeats on the settings page for your flow.

Dylan

05/14/2020, 9:38 PM

I’m not sure about

subprocess.run()

Dylan

05/14/2020, 9:38 PM

let me check with my team and get back to you

Dylan

05/14/2020, 9:53 PM

@Dan DiPasquo Heartbeats run in their own subprocess and are not blocking for calls to

subprocess.run()

Dan DiPasquo

05/14/2020, 10:04 PM

Hi @Dylan Thank you. Yes, I read that section of the docs; I suppose my question is, then, what sort of things DO prevent Core from sending a keep-alive heartbeat, such that it might narrow down my search for the underlying cause of my flows being killed?

Dylan

05/14/2020, 10:11 PM

We once experienced some problems with heartbeats as they relate to version locking and Dask

Dylan

05/14/2020, 10:12 PM

If you turn them off, you can set timeouts on your tasks so they don’t run forever

Dylan

05/14/2020, 10:12 PM

But most often, tasks die because they run out of memory or other resources on the node they’re running on

Dan DiPasquo

05/14/2020, 11:11 PM

I have found some log errors concurrent with tasks that have been kiilled by Zombie Killer, e.g.

Copy code

Failed to set task state with error: ClientError([{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [{'line': 2, 'column': 5}], 'path': ['set_task_run_states'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'errors': [{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [], 'path': ['set_task_run_states']}]}}}])
T

Dan DiPasquo

05/14/2020, 11:17 PM

Copy code

2020-05-14T22:31:52.702149+00:00  ERROR    Heartbeat failed for Task 'delete_local_file'
                                           Traceback (most recent call last):
                                             File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 64, in _heartbeat
                                               self.client.update_task_run_heartbeat(task_run_id)
                                             File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 910, in update_task_run_heartbeat
                                               self.graphql(mutation, raise_on_error=True)
                                             File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 228, in graphql
                                               raise ClientError(result["errors"])
                                           prefect.utilities.exceptions.ClientError: [{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [{'line': 2, 'column': 5}], 'path': ['update_task_run_heartbeat'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'errors': [{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [], 'path': ['update_task_run_heartbeat']}]}}}]

Chris White

05/14/2020, 11:19 PM

Hey Dan, at first glance that looks like a networking issue on our side 🧐

Dan DiPasquo

05/14/2020, 11:19 PM

These are Flows that seem to be successfully sending status changes both before and after (wall clock/log time) these errors occur in logs.

Chris White

05/14/2020, 11:19 PM

is this consistently happening?

Dan DiPasquo

05/14/2020, 11:20 PM

It's say it's regularly, but not consistently. As we've recently scaled up, ~10% of our flow runs end with Zombie Killer

Chris White

05/14/2020, 11:21 PM

Hmmmm that sounds suspicious — could you describe the runtime environment for these flows? K8s / Dask / Docker / Local / etc.?

Dan DiPasquo

05/14/2020, 11:21 PM

Rerun of same flows usually succeed.

Dan DiPasquo

05/14/2020, 11:21 PM

k8s via GKE

Dan DiPasquo

05/14/2020, 11:23 PM

It's usually one of several longer-running tasks (makes sense, more heartbeats)- but also observed in a task that should be very shortlived -

Dan DiPasquo

05/14/2020, 11:24 PM

There's a good run of these ClientErrors in logs @ 2020-05-14T223152.485472+00:00

Dan DiPasquo

05/14/2020, 11:24 PM

give or take 100ms

Dan DiPasquo

05/14/2020, 11:24 PM

🙂

Chris White

05/14/2020, 11:27 PM

Yea — those client errors appear to correlate with some increased error rates on our side; apologies for that, we’re looking into what might have triggered it. In the meantime, it sounds like you have a zombie issue regardless — I recommend turning off heartbeats for now. If you’re able to share the docker image you’re running on that might be useful here — my suspicion is that there is some resource limit we’re hitting with respect to the number of subprocesses that can run smoothly

Dan DiPasquo

05/14/2020, 11:33 PM

Yes I don't find these ClientErrors in the logs of every flow killed by Zombie Killer so there may be more than one thing going on.

Dan DiPasquo

05/14/2020, 11:35 PM

I don't know about sharing the docker image, I will check with colleagues. I think it would be very hard for you to run the flow as it reaches into GCS for specific resources., etc. In the meantime I may disable Zombie Killer as suggested and see if it illuminates anything.

Dan DiPasquo

05/14/2020, 11:35 PM

Thanks for your help!

Chris White

05/14/2020, 11:36 PM

Anytime! Keep us in the loop if you see anything else that seems relevant here; we actually have a deep dive scheduled on Monday with another user who sounds like they’re running into a similar issue so we can report back if we learn anything from that

Dan DiPasquo

05/14/2020, 11:40 PM

Thanks, if i figure anything out or even find any more clues I'll let you know.

👍 1

Dan DiPasquo

05/15/2020, 5:39 AM

By way of another clue, I ran another set of similar Flow runs, this time with Zombie Killer disabled, 96 out of 98 succeeded, with 2 running indefinitely. When I dug into these two, I found that one of the tasks that is a typical victim of Zombie Killer when it's been enabled, this time these tasks never completed successfully, but never Failed either. In fact, logs show that task.run() was called a second time (not seen in the 96 successful flow runs) , and bizarrely, the second flow run took place in a different k8s pod! Log:

Copy code

2020-05-15T01:14:47.611286+00:00  INFO     Task '<****>': Starting task run...
2020-05-15T01:14:50.682754+00:00  DEBUG    Task '<****>': 751 candidate cached states were found
2020-05-15T01:14:50.683424+00:00  DEBUG    Task '<****>': can't use cache because no candidate Cached states were valid
..... repeat log 749 times...
2020-05-15T01:14:50.68382+00:00   DEBUG    Task '<****>': can't use cache because no candidate Cached states were valid
2020-05-15T01:14:50.803267+00:00  DEBUG    Task '<****>': Handling state change from Pending to Running
2020-05-15T01:14:51.055126+00:00  DEBUG    Task '<****>': Calling task.run() method...
2020-05-15T01:19:16.827716+00:00  INFO     Beginning Flow run for '<****>>-production'
2020-05-15T01:19:17.488745+00:00  INFO     Task '<****>': Starting task run...
2020-05-15T01:19:17.489122+00:00  DEBUG    Task '<****>': task is already running.
2020-05-15T01:19:17.497588+00:00  INFO     Task '<****>': finished task run for task with final state: 'Running'

Dan DiPasquo

05/15/2020, 5:49 AM

I traced the first bit of that logging was traced to one k8s pod, ending suddenly at Calling task.run() method... And the second part of that logging traced to a different pod that seemingly did a second flow.run() on same Flow and immediately started with a task that typically has upstream dependencies .... so, I think this is a clue, but I am at loss from here, right now.

6 Views

Open in Slack

Previous Next