Saurabh Indoria
03/14/2022, 6:39 AMNo heartbeat detected from the remote task; retrying the run.This will be retry 1 of 3.
and then it never actually retries.. I believe the Lazarus process must kick in every 10 minutes and reschedule the task, right?
CC: @Christina Lopez @Kevin Kho @Anna GellerAnna Geller
03/14/2022, 10:27 AMKevin Kho
03/14/2022, 2:02 PMChristina Lopez
03/14/2022, 2:42 PMSaurabh Indoria
03/14/2022, 3:15 PMAnna Geller
03/14/2022, 3:17 PMSaurabh Indoria
03/14/2022, 3:18 PMretrying the run.This will be retry 1 of 3.
, shouldn't that mean it would actually retry?Kevin Kho
03/14/2022, 3:43 PMSaurabh Indoria
03/14/2022, 3:47 PMKevin Kho
03/14/2022, 3:47 PMSaurabh Indoria
03/14/2022, 3:52 PMKevin Kho
03/14/2022, 3:54 PMAnna Geller
03/14/2022, 3:55 PMOur tasks are simple microservice calls. The actual compute intensive work happens on our microservices.That's actually more problematic than if you would compute something directly within your flow run pod because this introduces one more layer of complexity - I've written a more detailed explanation here https://discourse.prefect.io/t/flow-is-failing-with-an-error-message-no-heartbeat-detected-from-the-remote-task/79#flow-heartbeat-[…]ubernetes-job-5 I agree with Kevin that what may be happening is some issue in the compute i.e. the flow run pod, where your
LocalDaskExecutor
executes mapped child task runs in separate threads or processes, runs out of memory, or comes across some network issues. Within each of those threads or subprocesses, you are likely spinning up other subprocesses or directly triggering microservice API calls, and waiting till they finish the execution there. And if e.g. the flow run pod runs out of memory, it gets tricky to find out which subprocess call led to that issue.
So what might help here would be to:
1. As Kevin mentioned, switch to threads heartbeat mode to avoid having each of those child task-runs being spun up in a subprocess
2. Switch to threads on your LocalDaskExecutor(scheduler="threads")
- if you are not using threads already
3. Offload those microservice calls into individual subflows that you can trigger from a parent flow in a flow of flows. You could even turn off the heartbeats for those subflow to prevent such heartbeat errors. This would even allow you to run each child flow in a separate Kubernetes pod to isolate failure of each of those child componentsSaurabh Indoria
03/14/2022, 3:59 PMKevin Kho
03/14/2022, 4:02 PM