Hi all There seems to be some issue with Prefect Cloud heart Prefect Community #prefect-server

Hi all, There seems to be some issue with Prefect ...

Saurabh Indoria

03/14/2022, 6:39 AM

Hi all, There seems to be some issue with Prefect Cloud heartbeats. Randomly, some mapped tasks show an error:

No heartbeat detected from the remote task; retrying the run.This will be retry 1 of 3.

and then it never actually retries.. I believe the Lazarus process must kick in every 10 minutes and reschedule the task, right? CC: @Christina Lopez @Kevin Kho @Anna Geller

Saurabh Indoria

03/14/2022, 6:39 AM

Sample logs: https://cloud.prefect.io/quiltai/task-run/3f08f7c4-e896-4d9a-bacd-6c67068ef545?logs

Saurabh Indoria

03/14/2022, 6:39 AM

CC: @Yash Joshi

Anna Geller

03/14/2022, 10:27 AM

Thanks for reporting your issue. Can you share a bit more about your setup - which agent do you use? you mentioned mapped task - do you run it on Dask, if so what type of Dask cluster is it? Also, can you send me the flow run ID rather than the task run ID?

Kevin Kho

03/14/2022, 2:02 PM

Lazarus will resubmit Flow runs, not task runs. In this case though, it looks like the Flow compute died. What does the task do? Is there a chance your tasks are competing for resources?

👀 1

Christina Lopez

03/14/2022, 2:42 PM

@Saurabh Indoria check out Anna and Kevin’s response.

Saurabh Indoria

03/14/2022, 3:15 PM

Our setup is Prefect Cloud + Kubernetes + LocalDaskExecutor

Saurabh Indoria

03/14/2022, 3:16 PM

Here is the flow run ID: https://cloud.prefect.io/flow-run/23971771-6758-4f31-991f-27f633364988 @Anna Geller

👍 1

Anna Geller

03/14/2022, 3:17 PM

thx will check

Saurabh Indoria

03/14/2022, 3:18 PM

Thanks @Anna Geller

Saurabh Indoria

03/14/2022, 3:18 PM

@Kevin Kho Our tasks are simple microservice calls. The actual compute intensive work happens on our microservices. Regardless of our task size, when the logs say

retrying the run.This will be retry 1 of 3.

, shouldn't that mean it would actually retry?

Kevin Kho

03/14/2022, 3:43 PM

Ah ok. Can you try using threaded heartbeats for both the main flow and subflows? These tend to be more stable

Saurabh Indoria

03/14/2022, 3:47 PM

I believe the default is thread, right? We haven't changed the heartbeat configuration explicitly, so I assumed it is threaded..

Kevin Kho

03/14/2022, 3:47 PM

The default is ‘process’

Kevin Kho

03/14/2022, 3:48 PM

You can find more info here

Saurabh Indoria

03/14/2022, 3:52 PM

Oh I see... Thanks, will switch to threaded heartbeat mode!

Kevin Kho

03/14/2022, 3:54 PM

To the question though, the Flow Run will retry, but I don’t know if the task retries because those are two separate things. You are mapping over create_flow_run right?

Anna Geller

03/14/2022, 3:55 PM

Thanks for sharing more info and the flow run ID. The logs don't provide any more info than the error you shared already so no new insights from that.

Our tasks are simple microservice calls. The actual compute intensive work happens on our microservices.

That's actually more problematic than if you would compute something directly within your flow run pod because this introduces one more layer of complexity - I've written a more detailed explanation here https://discourse.prefect.io/t/flow-is-failing-with-an-error-message-no-heartbeat-detected-from-the-remote-task/79#flow-heartbeat-[…]ubernetes-job-5 I agree with Kevin that what may be happening is some issue in the compute i.e. the flow run pod, where your

LocalDaskExecutor

executes mapped child task runs in separate threads or processes, runs out of memory, or comes across some network issues. Within each of those threads or subprocesses, you are likely spinning up other subprocesses or directly triggering microservice API calls, and waiting till they finish the execution there. And if e.g. the flow run pod runs out of memory, it gets tricky to find out which subprocess call led to that issue. So what might help here would be to: 1. As Kevin mentioned, switch to threads heartbeat mode to avoid having each of those child task-runs being spun up in a subprocess 2. Switch to threads on your

LocalDaskExecutor(scheduler="threads")

- if you are not using threads already 3. Offload those microservice calls into individual subflows that you can trigger from a parent flow in a flow of flows. You could even turn off the heartbeats for those subflow to prevent such heartbeat errors. This would even allow you to run each child flow in a separate Kubernetes pod to isolate failure of each of those child components

Anna Geller

03/14/2022, 3:56 PM

If you can share your flow, this could help us with finding the issue and suggest things you can try, as we wouldn't have to make as many assumptions as we do right now 😄

Saurabh Indoria

03/14/2022, 3:59 PM

I see, thanks a lot for the detailed response. Let me try out your suggestions before I share the flow (don't wanna bore you with the code 😛 )

Saurabh Indoria

03/14/2022, 4:01 PM

@Kevin Kho We don't map over create_flow_run, instead, we just map a task over a list of pandas dataframes...

Kevin Kho

03/14/2022, 4:02 PM

Ohh I see I misunderstood yes try Anna’s suggestions

🆗 1

🙏 1

3 Views

Open in Slack

Previous Next