Tom Klein
07/13/2022, 12:40 PMNo heartbeat detected from the remote task; marking the run as failed
- this happend for tasks being run in parallel via the dask KubeCluster
looking at our (internal) logs - it seems like all of them died simultaneously, 20+ minutes into their run
there’s a total of 58 tasks being run, with 4 workers
the first 4 tasks took approx. 1 hour to run and completed successfully, and then the next 4 that were run all seem to have failed simultaneously
any idea / help?"2022-07-13 12:23:45,664 - distributed._signals - INFO - Received signal SIGTERM (15)",
"2022-07-13 12:23:45,664 - distributed.nanny - INFO - Closing Nanny at '<tcp://x.y.z.w:38179>'.",
which doesn’t really tell us muchAnna Geller
07/13/2022, 12:42 PMTom Klein
07/13/2022, 12:48 PMAnna Geller
07/13/2022, 12:55 PMTom Klein
07/13/2022, 12:57 PMthreads
thing.
trying to restart the flow (with the restart button, i.e. with checkpointing) failed btw with the k8s error:
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"services \"tmpdask\" already exists","reason":"AlreadyExists","details":{"name":"tmpdask","kind":"services"},"code":409}
so i guess we should give our temporary dask cluster a temporary name, though i’m not sure it will help? isn’t a flow restart considered to have the same flow_run_name
?
but does the threads
option affect how the FLOW sends the heartbeat, or how the tasks send heartbeats to the flow?Anna Geller
07/13/2022, 1:08 PMTom Klein
07/13/2022, 1:33 PMrestart
at the top right
doesn’t it run the flow again and use checkpointing to fetch results of successfully completed tasks…?Anna Geller
07/13/2022, 1:40 PMTom Klein
07/13/2022, 1:43 PMflow
ended up failing because it tried to recreate the dask cluster
with the same name (i guess?) - while the previous one was still a zombie --- which led to failure
at least that’s what i understand from the error.
so the question is - does restarting a flow lead to a different flow run name
or flow run id
or something that i can use (from prefect.context
) to make sure that the dask cluster has a unique name?
obviously if i use flow_run_name
it would be unique per flow run, the question is - would it also be unique across flow restarts?
would these arguments:
namespace="prefect", name=f"tmp_dask_{prefect.context.flow_run_name}", env=my_env
be sufficient to guarantee the cluster has a unique name across restarts? or do i need to use flow_run_id
? or something else?restart
it because if it failed because of tasks (as you said) then i don’t want to start the whole flow from the beginning, just re-do the failed tasks…Anna Geller
07/13/2022, 6:21 PMTom Klein
07/13/2022, 6:31 PMTask 'wrapped_enrich_shell_task[55]': Finished task run for task with final state: 'Running'
does that make any sense? how can Running
being a final state?