Hi again :smile: I just got the error `No heartbe...
# prefect-community
t
Hi again 😄 I just got the error
No heartbeat detected from the remote task; marking the run as failed
- this happend for tasks being run in parallel via the dask
KubeCluster
looking at our (internal) logs - it seems like all of them died simultaneously, 20+ minutes into their run there’s a total of 58 tasks being run, with 4 workers the first 4 tasks took approx. 1 hour to run and completed successfully, and then the next 4 that were run all seem to have failed simultaneously any idea / help?
1
in our internal logging we see:
Copy code
"2022-07-13 12:23:45,664 - distributed._signals - INFO - Received signal SIGTERM (15)",
"2022-07-13 12:23:45,664 - distributed.nanny - INFO - Closing Nanny at '<tcp://x.y.z.w:38179>'.",
which doesn’t really tell us much
a
We are aware and we currently don't have a solution for this in 1.0, we'll investigate better handling of that in 2.0 later this year this post discusses it more https://discourse.prefect.io/t/flow-is-failing-with-an-error-message-no-heartbeat-detected-from-the-remote-task/79
t
But is there a way to tell if the problem is at the flow or the task level?
(we don’t see anything in our internal logging of the tasks, and they also [according to datadog] didn’t even come close to their memory limits)
a
task level
t
🤔 hmm.. ok i’ll try the
threads
thing. trying to restart the flow (with the restart button, i.e. with checkpointing) failed btw with the k8s error:
Copy code
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"services \"tmpdask\" already exists","reason":"AlreadyExists","details":{"name":"tmpdask","kind":"services"},"code":409}
so i guess we should give our temporary dask cluster a temporary name, though i’m not sure it will help? isn’t a flow restart considered to have the same
flow_run_name
? but does the
threads
option affect how the FLOW sends the heartbeat, or how the tasks send heartbeats to the flow?
a
there are no flow level restarts in 1.0, but we have those in 2.0
t
wait, but i did do a flow restart just now? is that not considered a flow restart? i stood on the flow run and clicked
restart
at the top right doesn’t it run the flow again and use checkpointing to fetch results of successfully completed tasks…?
a
so sorry - I meant retries restarts work in 1.0 - you're 100% right
t
oh, that’s fine, i was talking about the fact that manually restarting the
flow
ended up failing because it tried to recreate the
dask cluster
with the same name (i guess?) - while the previous one was still a zombie --- which led to failure at least that’s what i understand from the error. so the question is - does restarting a flow lead to a different
flow run name
or
flow run id
or something that i can use (from
prefect.context
) to make sure that the dask cluster has a unique name? obviously if i use
flow_run_name
it would be unique per flow run, the question is - would it also be unique across flow restarts? would these arguments:
Copy code
namespace="prefect", name=f"tmp_dask_{prefect.context.flow_run_name}", env=my_env
be sufficient to guarantee the cluster has a unique name across restarts? or do i need to use
flow_run_id
? or something else?
i am trying to
restart
it because if it failed because of tasks (as you said) then i don’t want to start the whole flow from the beginning, just re-do the failed tasks…
well - i tried it - apparently it doesn’t change on restarts. so i had to use some random set of letters to make sure it’s unique btw, i reduced the task size from 1000 CSV rows to 100 CSV rows and - i don’t wanna jinx it but - so far it seems to be doing OK….
🙌 1
and btw, i’m not really sure it fails at the task level at all, because all 4 workers fail at exactly the same time, so seems like a central failure to me:
a
thanks for sharing your solution
t
@Anna Geller i ran into this during the execution (which is still ongoing):
Copy code
Task 'wrapped_enrich_shell_task[55]': Finished task run for task with final state: 'Running'
does that make any sense? how can
Running
being a final state?
actually the full log for it is this. (not including here the previous 23 minutes of log where it was actually doing work)
found this: https://github.com/PrefectHQ/prefect/issues/5485 which might be related… no idea
👍 1
104 Views