Hi again :smile: I just got the error `No heartbe...
# prefect-community
Hi again 😄 I just got the error
No heartbeat detected from the remote task; marking the run as failed
- this happend for tasks being run in parallel via the dask
looking at our (internal) logs - it seems like all of them died simultaneously, 20+ minutes into their run there’s a total of 58 tasks being run, with 4 workers the first 4 tasks took approx. 1 hour to run and completed successfully, and then the next 4 that were run all seem to have failed simultaneously any idea / help?
in our internal logging we see:
Copy code
"2022-07-13 12:23:45,664 - distributed._signals - INFO - Received signal SIGTERM (15)",
"2022-07-13 12:23:45,664 - distributed.nanny - INFO - Closing Nanny at '<tcp://x.y.z.w:38179>'.",
which doesn’t really tell us much
We are aware and we currently don't have a solution for this in 1.0, we'll investigate better handling of that in 2.0 later this year this post discusses it more https://discourse.prefect.io/t/flow-is-failing-with-an-error-message-no-heartbeat-detected-from-the-remote-task/79
But is there a way to tell if the problem is at the flow or the task level?
(we don’t see anything in our internal logging of the tasks, and they also [according to datadog] didn’t even come close to their memory limits)
task level
🤔 hmm.. ok i’ll try the
thing. trying to restart the flow (with the restart button, i.e. with checkpointing) failed btw with the k8s error:
Copy code
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"services \"tmpdask\" already exists","reason":"AlreadyExists","details":{"name":"tmpdask","kind":"services"},"code":409}
so i guess we should give our temporary dask cluster a temporary name, though i’m not sure it will help? isn’t a flow restart considered to have the same
? but does the
option affect how the FLOW sends the heartbeat, or how the tasks send heartbeats to the flow?
there are no flow level restarts in 1.0, but we have those in 2.0
wait, but i did do a flow restart just now? is that not considered a flow restart? i stood on the flow run and clicked
at the top right doesn’t it run the flow again and use checkpointing to fetch results of successfully completed tasks…?
so sorry - I meant retries restarts work in 1.0 - you're 100% right
oh, that’s fine, i was talking about the fact that manually restarting the
ended up failing because it tried to recreate the
dask cluster
with the same name (i guess?) - while the previous one was still a zombie --- which led to failure at least that’s what i understand from the error. so the question is - does restarting a flow lead to a different
flow run name
flow run id
or something that i can use (from
) to make sure that the dask cluster has a unique name? obviously if i use
it would be unique per flow run, the question is - would it also be unique across flow restarts? would these arguments:
Copy code
namespace="prefect", name=f"tmp_dask_{prefect.context.flow_run_name}", env=my_env
be sufficient to guarantee the cluster has a unique name across restarts? or do i need to use
? or something else?
i am trying to
it because if it failed because of tasks (as you said) then i don’t want to start the whole flow from the beginning, just re-do the failed tasks…
well - i tried it - apparently it doesn’t change on restarts. so i had to use some random set of letters to make sure it’s unique btw, i reduced the task size from 1000 CSV rows to 100 CSV rows and - i don’t wanna jinx it but - so far it seems to be doing OK….
🙌 1
and btw, i’m not really sure it fails at the task level at all, because all 4 workers fail at exactly the same time, so seems like a central failure to me:
thanks for sharing your solution
@Anna Geller i ran into this during the execution (which is still ongoing):
Copy code
Task 'wrapped_enrich_shell_task[55]': Finished task run for task with final state: 'Running'
does that make any sense? how can
being a final state?
actually the full log for it is this. (not including here the previous 23 minutes of log where it was actually doing work)
found this: https://github.com/PrefectHQ/prefect/issues/5485 which might be related… no idea
👍 1