Hi again smile I just got the error `No heartbeat detected f Prefect Community #ask-community

Hi again :smile: I just got the error `No heartbe...

Tom Klein

07/13/2022, 12:40 PM

Hi again 😄 I just got the error

No heartbeat detected from the remote task; marking the run as failed

- this happend for tasks being run in parallel via the dask

KubeCluster

looking at our (internal) logs - it seems like all of them died simultaneously, 20+ minutes into their run there’s a total of 58 tasks being run, with 4 workers the first 4 tasks took approx. 1 hour to run and completed successfully, and then the next 4 that were run all seem to have failed simultaneously any idea / help?

✅ 1

Tom Klein

07/13/2022, 12:40 PM

Tom Klein

07/13/2022, 12:41 PM

in our internal logging we see:

Copy code

"2022-07-13 12:23:45,664 - distributed._signals - INFO - Received signal SIGTERM (15)",
"2022-07-13 12:23:45,664 - distributed.nanny - INFO - Closing Nanny at '<tcp://x.y.z.w:38179>'.",

which doesn’t really tell us much

Anna Geller

07/13/2022, 12:42 PM

We are aware and we currently don't have a solution for this in 1.0, we'll investigate better handling of that in 2.0 later this year this post discusses it more https://discourse.prefect.io/t/flow-is-failing-with-an-error-message-no-heartbeat-detected-from-the-remote-task/79

Tom Klein

07/13/2022, 12:48 PM

But is there a way to tell if the problem is at the flow or the task level?

Tom Klein

07/13/2022, 12:54 PM

(we don’t see anything in our internal logging of the tasks, and they also [according to datadog] didn’t even come close to their memory limits)

Anna Geller

07/13/2022, 12:55 PM

task level

Tom Klein

07/13/2022, 12:57 PM

🤔 hmm.. ok i’ll try the

threads

thing. trying to restart the flow (with the restart button, i.e. with checkpointing) failed btw with the k8s error:

Copy code

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"services \"tmpdask\" already exists","reason":"AlreadyExists","details":{"name":"tmpdask","kind":"services"},"code":409}

so i guess we should give our temporary dask cluster a temporary name, though i’m not sure it will help? isn’t a flow restart considered to have the same

flow_run_name

? but does the

threads

option affect how the FLOW sends the heartbeat, or how the tasks send heartbeats to the flow?

Anna Geller

07/13/2022, 1:08 PM

there are no flow level restarts in 1.0, but we have those in 2.0

Tom Klein

07/13/2022, 1:33 PM

wait, but i did do a flow restart just now? is that not considered a flow restart? i stood on the flow run and clicked

restart

at the top right doesn’t it run the flow again and use checkpointing to fetch results of successfully completed tasks…?

Anna Geller

07/13/2022, 1:40 PM

so sorry - I meant retries restarts work in 1.0 - you're 100% right

Tom Klein

07/13/2022, 1:43 PM

oh, that’s fine, i was talking about the fact that manually restarting the

flow

ended up failing because it tried to recreate the

dask cluster

with the same name (i guess?) - while the previous one was still a zombie --- which led to failure at least that’s what i understand from the error. so the question is - does restarting a flow lead to a different

flow run name

flow run id

or something that i can use (from

prefect.context

) to make sure that the dask cluster has a unique name? obviously if i use

flow_run_name

it would be unique per flow run, the question is - would it also be unique across flow restarts? would these arguments:

Copy code

namespace="prefect", name=f"tmp_dask_{prefect.context.flow_run_name}", env=my_env

be sufficient to guarantee the cluster has a unique name across restarts? or do i need to use

flow_run_id

? or something else?

Tom Klein

07/13/2022, 1:47 PM

i am trying to

restart

it because if it failed because of tasks (as you said) then i don’t want to start the whole flow from the beginning, just re-do the failed tasks…

Tom Klein

07/13/2022, 4:51 PM

well - i tried it - apparently it doesn’t change on restarts. so i had to use some random set of letters to make sure it’s unique btw, i reduced the task size from 1000 CSV rows to 100 CSV rows and - i don’t wanna jinx it but - so far it seems to be doing OK….

🙌 1

Tom Klein

07/13/2022, 4:52 PM

and btw, i’m not really sure it fails at the task level at all, because all 4 workers fail at exactly the same time, so seems like a central failure to me:

Anna Geller

07/13/2022, 6:21 PM

thanks for sharing your solution

Tom Klein

07/13/2022, 6:31 PM

@Anna Geller i ran into this during the execution (which is still ongoing):

Copy code

Task 'wrapped_enrich_shell_task[55]': Finished task run for task with final state: 'Running'

does that make any sense? how can

Running

being a final state?

Tom Klein

07/13/2022, 6:32 PM

actually the full log for it is this. (not including here the previous 23 minutes of log where it was actually doing work)

Tom Klein

07/13/2022, 6:57 PM

found this: https://github.com/PrefectHQ/prefect/issues/5485 which might be related… no idea

👍 1

171 Views

Open in Slack

Previous Next