Trying to test out prefect + coiled with gpus. i have a model training as a task and got "No heartbeat detected from the remote task; marking the run as failed." Is this something that can happen if the worker becomes too bogged down resource wise?
k
Kevin Kho
07/29/2021, 9:39 PM
Hey @Brett Jurman, normally when heartbeat are not detected, it’s Prefect’s way of alerting you that someone went wrong in your Flow. The heartbeat is a separate subprocess from the one your flow is running on. In 0.15.2, we added more logs around and from what we have seen so far, it seems to be memory related issues.
If you are confident your task will succeed, some users have had success in separating out the memory intense task into it’s own Flow, and then turning off heartbeats for that flow and triggering it with
create_flow_run
or
StartFlowRun
.
If heartbeats didn’t exist, the UI would show that the flow was running forever even if the underlying infrastructure died.
Out of curiosity, have you succeeded with a GPU-based flow and did you use the local agent or Docker agent to kick that off?
b
Brett Jurman
07/29/2021, 9:40 PM
im returning to that gpu based flow now
Brett Jurman
07/29/2021, 9:43 PM
the gpus are there, but recently it died from the heartbeat issue. It may be running out of memory, i can test that. is there a way
Brett Jurman
07/29/2021, 9:43 PM
is there any resource tracking i can see in prefect?
k
Kevin Kho
07/29/2021, 9:48 PM
Not on Prefect, but maybe through the Dask dashboard on Coiled?
🙌 1
b
Brett Jurman
07/29/2021, 9:54 PM
yeah you can see it through there
Brett Jurman
07/29/2021, 9:55 PM
it would be cool to be able to pull that back into the prefect ui somehow
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.