What might be causing Heartbeat failures from Pref...
# ask-community
h
What might be causing Heartbeat failures from Prefect / Dask? How would one go about debugging this?
a
for question #1: 1. Long running jobs 2. OOM errors for #2: 1. Inspect it on the execution layer e.g. on the Dask side 2. Increase the amount of memory on the scheduler 3. Trying to switch the flow’s heartbeat mode from processes to threads often helps solve such issues:
Copy code
from prefect.run_configs import UniversalRun
flow.run_config = UniversalRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})
h
We have a lot of long running jobs. Will these automatically cause heartbeat failures?
a
No, they should not, but they may. It depends on many factors (your infrastructure, resource allocation, network’s reliability, heartbeat mode, what the tasks are doing), so really hard to say. From my experience, as long as the scheduler and workers have enough memory and the heartbeat mode is set to thread, I didn’t have any flow heartbeat problems, but I’ve seen such issues among community users especially with long running jobs on Kubernetes.
h
We try to use as much of the 16 CPU:s as possible but I don't at all think they are 100% in use and would stop heartbeats. I'll try the thread mode then.
dask-worker-5656bd65f-spvhn dask-worker distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see <https://distributed.dask.org/en/latest/worker.html#memtrim> for more information. -- Unmanaged memory: 7.92 GiB -- Worker memory limit: 11.18 GiB
and this is being logged a lot in the cluster too
You're right about the scheduler, it has these logs:
Copy code
distributed.core - ERROR - Exception while handling op gather
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/core.py", line 498, in handle_comm
    result = await result
  File "/opt/conda/lib/python3.9/site-packages/distributed/scheduler.py", line 5759, in gather
    data, missing_keys, missing_workers = await gather_from_workers(
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils_comm.py", line 88, in gather_from_workers
    response.update(r["data"])
KeyError: 'data'
@Anna Geller Like this for the heartbeats?
upvote 1
As for network: I hope it's possible to make this resilient to failure, because when we actually start scaling this, it's going to be 10x-1000x the current load and then we can't have network blips or machine restarts cause problems.
The tasks are doing CPU work only (right now) but we'll be using GPUs/TPUs shortly.
As for the memory; I don't see any reason that memory is not released back — if manually forcing a GC/release is part of the recommended ops guidelines, why isn't that just being run per default? 🙂
a
There was some PR open about memory allocation in Dask, I’ll share when I find it. Feel free to dive deeper and submit a PR if you see some potential for improving the memory allocation in the Dask interface
h
I guess this is not recommended (actually) but I don't think we have many leaks since we're using pandas.
a
k
This is a hard issue. This guy had the same thing but there’s nothing around it. I would maybe try bumping up your Dask issue and hope it helps. 2021.06 and above has a bunch of enhancements
upvote 1
h
k
You making a PR for it on distributed?
h
😕 hmm
It's a really complex piece of software. Like for instance; what invariants do these functions have? FLP result style: what are their timeouts?
I could make possibly make it return on status=busy but that would affect timing on the caller
It's returning a dict when it should be returning an Enum (unless they check for both whenever this is called)
I would have done a discriminated union-style object but maybe this is more pythonic
But that logic means they'll be increasingly slow as the uptime increases https://github.com/dask/distributed/blob/60cb52f3d0e6ab68e0acf1e3f26a670376c001a2/distributed/worker.py#L2998
self.repetitively_busy
is never reset when communication succeeds
Maybe this is a simple way to solve it https://github.com/dask/distributed/pull/5546