Max Mose

06/03/2022, 4:34 PM
Hi, we're having an issue with prefect cloud and the heartbeat mechanism marking a task as failed, but not marking the flow as failed. We are running flows using an EKS cluster that uses spot instances, so we rely on the heartbeat mechanism to know to retry flows that die because the underlying node is reclaimed by AWS. For flows where we're encountering the issue(see flow
), the log messages state:
No heartbeat detected from the remote task; marking the run as failed.
for some task, but the flow is still in a running state. The issue is intermittent,
is a flow where the heartbeat marked the same task as failed and it also successfully marked the flow as failed. Any insight into what may be going on here would be much appreciated!

Kevin Kho

06/03/2022, 5:07 PM
Is there Dask involved here? This different behavior is the same flow, just different runs right?

Max Mose

06/03/2022, 5:44 PM
We removed the dask executor from this flow and it is still happening. Yep, it is the same flow, but it happens at random from run to run.