d

    Dan DiPasquo

    2 years ago
    Trying to understand why some flows die due to Zombie Killer when other, and then others, similar, don't. Sometimes see tasks succeed after 5 minutes or more without being killed, while others killed after 3 mins or so. With Zombie Killer enabled, does every @task wrapped function need to complete in < 2 minutes or risk being terminated? For example, if a task includes a subprocess.run() that does some long running work, is the heartbeat blocked until subprocess.run completes?
    Dylan

    Dylan

    2 years ago
    Hi @Dan DiPasquo! Heartbeats are sent by Prefect Core every 30 seconds and are used to confirm the flow run and its associated task runs are healthy; runs missing four heartbeats in a row are marked as Failed by the Zombie Killer . You can read more about heartbeats here: https://docs.prefect.io/orchestration/concepts/flows.html#flow-settings You can turn off heartbeats on the settings page for your flow.
    I’m not sure about
    subprocess.run()
    let me check with my team and get back to you
    @Dan DiPasquo Heartbeats run in their own subprocess and are not blocking for calls to
    subprocess.run()
    d

    Dan DiPasquo

    2 years ago
    Hi @Dylan Thank you. Yes, I read that section of the docs; I suppose my question is, then, what sort of things DO prevent Core from sending a keep-alive heartbeat, such that it might narrow down my search for the underlying cause of my flows being killed?
    Dylan

    Dylan

    2 years ago
    We once experienced some problems with heartbeats as they relate to version locking and Dask
    If you turn them off, you can set timeouts on your tasks so they don’t run forever
    But most often, tasks die because they run out of memory or other resources on the node they’re running on
    d

    Dan DiPasquo

    2 years ago
    I have found some log errors concurrent with tasks that have been kiilled by Zombie Killer, e.g.
    Failed to set task state with error: ClientError([{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [{'line': 2, 'column': 5}], 'path': ['set_task_run_states'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'errors': [{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [], 'path': ['set_task_run_states']}]}}}])
    T
    2020-05-14T22:31:52.702149+00:00  ERROR    Heartbeat failed for Task 'delete_local_file'
                                               Traceback (most recent call last):
                                                 File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 64, in _heartbeat
                                                   self.client.update_task_run_heartbeat(task_run_id)
                                                 File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 910, in update_task_run_heartbeat
                                                   self.graphql(mutation, raise_on_error=True)
                                                 File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 228, in graphql
                                                   raise ClientError(result["errors"])
                                               prefect.utilities.exceptions.ClientError: [{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [{'line': 2, 'column': 5}], 'path': ['update_task_run_heartbeat'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'errors': [{'message': 'request to <http://graphql:443/graphql/alpha/> failed, reason: connect ECONNREFUSED 10.30.42.241:443', 'locations': [], 'path': ['update_task_run_heartbeat']}]}}}]
    Chris White

    Chris White

    2 years ago
    Hey Dan, at first glance that looks like a networking issue on our side 🧐
    d

    Dan DiPasquo

    2 years ago
    These are Flows that seem to be successfully sending status changes both before and after (wall clock/log time) these errors occur in logs.
    Chris White

    Chris White

    2 years ago
    is this consistently happening?
    d

    Dan DiPasquo

    2 years ago
    It's say it's regularly, but not consistently. As we've recently scaled up, ~10% of our flow runs end with Zombie Killer
    Chris White

    Chris White

    2 years ago
    Hmmmm that sounds suspicious — could you describe the runtime environment for these flows? K8s / Dask / Docker / Local / etc.?
    d

    Dan DiPasquo

    2 years ago
    Rerun of same flows usually succeed.
    k8s via GKE
    It's usually one of several longer-running tasks (makes sense, more heartbeats)- but also observed in a task that should be very shortlived -
    There's a good run of these ClientErrors in logs @ 2020-05-14T22:31:52.485472+00:00
    give or take 100ms
    🙂
    Chris White

    Chris White

    2 years ago
    Yea — those client errors appear to correlate with some increased error rates on our side; apologies for that, we’re looking into what might have triggered it. In the meantime, it sounds like you have a zombie issue regardless — I recommend turning off heartbeats for now. If you’re able to share the docker image you’re running on that might be useful here — my suspicion is that there is some resource limit we’re hitting with respect to the number of subprocesses that can run smoothly
    d

    Dan DiPasquo

    2 years ago
    Yes I don't find these ClientErrors in the logs of every flow killed by Zombie Killer so there may be more than one thing going on.
    I don't know about sharing the docker image, I will check with colleagues. I think it would be very hard for you to run the flow as it reaches into GCS for specific resources., etc. In the meantime I may disable Zombie Killer as suggested and see if it illuminates anything.
    Thanks for your help!
    Chris White

    Chris White

    2 years ago
    Anytime! Keep us in the loop if you see anything else that seems relevant here; we actually have a deep dive scheduled on Monday with another user who sounds like they’re running into a similar issue so we can report back if we learn anything from that
    d

    Dan DiPasquo

    2 years ago
    Thanks, if i figure anything out or even find any more clues I'll let you know.
    By way of another clue, I ran another set of similar Flow runs, this time with Zombie Killer disabled, 96 out of 98 succeeded, with 2 running indefinitely. When I dug into these two, I found that one of the tasks that is a typical victim of Zombie Killer when it's been enabled, this time these tasks never completed successfully, but never Failed either. In fact, logs show that task.run() was called a second time (not seen in the 96 successful flow runs) , and bizarrely, the second flow run took place in a different k8s pod! Log:
    2020-05-15T01:14:47.611286+00:00  INFO     Task '<****>': Starting task run...
    2020-05-15T01:14:50.682754+00:00  DEBUG    Task '<****>': 751 candidate cached states were found
    2020-05-15T01:14:50.683424+00:00  DEBUG    Task '<****>': can't use cache because no candidate Cached states were valid
    ..... repeat log 749 times...
    2020-05-15T01:14:50.68382+00:00   DEBUG    Task '<****>': can't use cache because no candidate Cached states were valid
    2020-05-15T01:14:50.803267+00:00  DEBUG    Task '<****>': Handling state change from Pending to Running
    2020-05-15T01:14:51.055126+00:00  DEBUG    Task '<****>': Calling task.run() method...
    2020-05-15T01:19:16.827716+00:00  INFO     Beginning Flow run for '<****>>-production'
    2020-05-15T01:19:17.488745+00:00  INFO     Task '<****>': Starting task run...
    2020-05-15T01:19:17.489122+00:00  DEBUG    Task '<****>': task is already running.
    2020-05-15T01:19:17.497588+00:00  INFO     Task '<****>': finished task run for task with final state: 'Running'
    I traced the first bit of that logging was traced to one k8s pod, ending suddenly at Calling task.run() method... And the second part of that logging traced to a different pod that seemingly did a second flow.run() on same Flow and immediately started with a task that typically has upstream dependencies .... so, I think this is a clue, but I am at loss from here, right now.