https://prefect.io logo
p

Philipp Eisen

01/20/2022, 2:32 PM
Hey I’m running prefect with a kuberentes agent and temporary dask cluster I’m quite frequently getting this error:
No heartbeat detected from the remote task; marking the run as failed.
Is there some obvious things to look for?
k

Kevin Kho

01/20/2022, 2:37 PM
Hi @Philipp Eisen, I think this will help
And also this. Some users found
thread
heartbeats to be more stable
p

Philipp Eisen

01/20/2022, 2:44 PM
But those docs are referring to flow heartbeats, right? I think I’m having issues with task hearbeats. The errors I see in my logs are:
Copy code
Unexpected error: KilledWorker('xxx-148f1de6c7e840498044bee6c2534264', <WorkerState '<tcp://10.48.39.117:32963>', name: 21, status: closed, memory: 0, processing: 27>)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner
    new_state = method(self, state, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/flow_runner.py", line 542, in get_flow_run_state
    upstream_states = executor.wait(
  File "/usr/local/lib/python3.8/site-packages/prefect/executors/dask.py", line 440, in wait
    return self.client.gather(futures)
  File "/usr/local/lib/python3.8/site-packages/distributed/client.py", line 1946, in gather
    return self.sync(
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 310, in sync
    return sync(
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 364, in sync
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 349, in f
    result[0] = yield future
  File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/usr/local/lib/python3.8/site-packages/distributed/client.py", line 1811, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ('xxx-148f1de6c7e840498044bee6c2534264', <WorkerState '<tcp://10.48.39.117:32963>', name: 21, status: closed, memory: 0, processing: 27>)
And then
Copy code
No heartbeat detected from the remote task; marking the run as failed.
k

Kevin Kho

01/20/2022, 2:45 PM
The Flow level is applied to each task so this should apply
p

Philipp Eisen

01/20/2022, 2:56 PM
Thanks!
k

Kevin Kho

01/20/2022, 3:44 PM
Also the post right below this shows some situations where this happens
2 Views