What might be causing Heartbeat failures from Prefect Dask H Prefect Community #ask-community

Join Slack

What might be causing Heartbeat failures from Pref...

# ask-community

haf

11/25/2021, 1:30 PM

What might be causing Heartbeat failures from Prefect / Dask? How would one go about debugging this?

Anna Geller

11/25/2021, 1:37 PM

for question #1: 1. Long running jobs 2. OOM errors for #2: 1. Inspect it on the execution layer e.g. on the Dask side 2. Increase the amount of memory on the scheduler 3. Trying to switch the flow’s heartbeat mode from processes to threads often helps solve such issues:

Copy code

from prefect.run_configs import UniversalRun
flow.run_config = UniversalRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})

haf

11/25/2021, 1:43 PM

We have a lot of long running jobs. Will these automatically cause heartbeat failures?

Anna Geller

11/25/2021, 1:46 PM

No, they should not, but they may. It depends on many factors (your infrastructure, resource allocation, network’s reliability, heartbeat mode, what the tasks are doing), so really hard to say. From my experience, as long as the scheduler and workers have enough memory and the heartbeat mode is set to thread, I didn’t have any flow heartbeat problems, but I’ve seen such issues among community users especially with long running jobs on Kubernetes.

haf

11/25/2021, 1:49 PM

We try to use as much of the 16 CPU:s as possible but I don't at all think they are 100% in use and would stop heartbeats. I'll try the thread mode then.

dask-worker-5656bd65f-spvhn dask-worker distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see <https://distributed.dask.org/en/latest/worker.html#memtrim> for more information. -- Unmanaged memory: 7.92 GiB -- Worker memory limit: 11.18 GiB

and this is being logged a lot in the cluster too

haf

11/25/2021, 1:52 PM

You're right about the scheduler, it has these logs:

Copy code

distributed.core - ERROR - Exception while handling op gather
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/core.py", line 498, in handle_comm
    result = await result
  File "/opt/conda/lib/python3.9/site-packages/distributed/scheduler.py", line 5759, in gather
    data, missing_keys, missing_workers = await gather_from_workers(
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils_comm.py", line 88, in gather_from_workers
    response.update(r["data"])
KeyError: 'data'

haf

11/25/2021, 1:53 PM

@Anna Geller Like this for the heartbeats?

upvote 1

haf

11/25/2021, 1:55 PM

As for network: I hope it's possible to make this resilient to failure, because when we actually start scaling this, it's going to be 10x-1000x the current load and then we can't have network blips or machine restarts cause problems.

haf

11/25/2021, 1:55 PM

The tasks are doing CPU work only (right now) but we'll be using GPUs/TPUs shortly.

haf

11/25/2021, 1:58 PM

As for the memory; I don't see any reason that memory is not released back — if manually forcing a GC/release is part of the recommended ops guidelines, why isn't that just being run per default? 🙂

Anna Geller

11/25/2021, 2:01 PM

There was some PR open about memory allocation in Dask, I’ll share when I find it. Feel free to dive deeper and submit a PR if you see some potential for improving the memory allocation in the Dask interface

haf

11/25/2021, 2:01 PM

I guess this is not recommended (actually) but I don't think we have many leaks since we're using pandas.

Anna Geller

11/25/2021, 2:02 PM

https://github.com/PrefectHQ/prefect/pull/5004

Kevin Kho

11/25/2021, 3:25 PM

This is a hard issue. This guy had the same thing but there’s nothing around it. I would maybe try bumping up your Dask issue and hope it helps. 2021.06 and above has a bunch of enhancements

upvote 1

haf

11/25/2021, 9:19 PM

Ok, that's actually not a hard bug to fix https://github.com/dask/distributed/issues/4698#issuecomment-979473344

Kevin Kho

11/25/2021, 9:22 PM

You making a PR for it on distributed?

haf

11/25/2021, 9:29 PM

😕 hmm

haf

11/25/2021, 9:31 PM