How to detect state == FAILED as soon as heartbeat fails We Prefect Community #prefect-server

How to detect state == FAILED as soon as heartbeat...

jack

12/03/2021, 6:20 PM

How to detect state == FAILED as soon as heartbeat fails? We have a flow set up to run on ECS that purposefully consumes all the available memory on the container. (We want to make sure we can handle such edge cases.) We monitor the prefect logs, and this message comes through:

No heartbeat detected from the remote task; marking the run as failed.

For 20+ minutes following that log message, fetching the flow run state from prefect cloud still shows

<Running: "Running flow.">

Ideally, as soon as the flow run is marked as failed, state from prefect cloud would say Failed. Suggestions?

Kevin Kho

12/03/2021, 6:26 PM

Will ask the team but I’m not sure this can be sped up because the Flow is responsible for updating its state through API calls. If the heartbeat dies, then Cloud/Server notices it lost communication with the Flow and then it will mark the Flow as failed. This delay likely comes from a batching of updates to the database (similar to logs) so it won’t be instant

Kevin Kho

12/03/2021, 6:27 PM

In Prefect Cloud we have automations that let you dictate certain actions when your Flows hit certain states. “If this fails, send a notification” or “If this fails, trigger this other Flow” so it’s these automations that trigger events based on Flow state.

jack

12/03/2021, 6:42 PM

The flow-run that is going right now, 20 minutes ago the logs showed ``No heartbeat detected from the remote task; marking the run as failed.` and the state for the flow still shows as "Running"

Kevin Kho

12/03/2021, 6:50 PM

Will check with the team

jack

12/03/2021, 7:34 PM

Yesterday when testing this scenario, it seemed like after 5-10 minutes the flow state would finally come back as Failed. Today they don't seem to arrive at Failed at all. (allowed one to go for 20+ minutes, allowed another to go 30+ minutes)

Kevin Kho

12/03/2021, 7:38 PM

It sounds like there might be a backup or it’s failing to write to the database from some reason. Are other flows able to execute and are your other services heathy (towel)?

jack

12/03/2021, 7:47 PM

If we pass in different arguments to the flow (such that it does not consume so much memory), the flow succeeds.

Kevin Kho

12/03/2021, 7:56 PM

Do you have any logs from

towel

, the container with the

Zombie Killer

service?

jack

12/05/2021, 5:13 AM

Where would I go to find towel logs?

Kevin Kho

12/05/2021, 6:36 PM

It should be one of the containers spun up by

prefect server start

. You would look for the container logs

jack

12/05/2021, 7:20 PM

We are using prefect cloud.

Kevin Kho

12/05/2021, 7:54 PM

Oh sorry thought you were on server cuz this is the server channel. Will follow up with the team tomorrow.

Anna Geller

12/06/2021, 9:50 AM

@jack based on:

If we pass in different arguments to the flow (such that it does not consume so much memory), the flow succeeds.

it looks like your flow runs are running out of memory, which causes the flow heartbeat’s to be lost. You noticed correctly that when you assign more memory, then this doesn’t happen. Especially when using ECS, I would definitely try and bump up the memory on your flow’s ECS task definition or run configuration. To explain why this behavior happens: flow heartbeats signal to Prefect Cloud that your flow is alive. If Prefect didn’t have heartbeats, flows that lose communication and die would permanently be shown as Running in the UI (this is what you experienced). Since your ECS container dies due to memory issues, flow heartbeats die with it and Prefect has no way of telling whether this flow run in the end failed/succeeded or was manually cancelled.

jack

12/06/2021, 4:12 PM

Hi @Anna Geller The workloads we run vary in size. We have been purposefully testing the edge case where ECS runs out of memory so that we won't be surprised in the future when one of our production flows runs out of memory. We only offered that "with different parameters the flow completes normally" because Kevin had asked "Are other flows able to execute and are your other services heathy?" We had hoped that since the prefect log says

No heartbeat detected from the remote task; marking the run as failed

that when we query prefect for the flow state, it would also say the flow has failed. So far that is not happening.

Anna Geller

12/06/2021, 4:22 PM

@jack Got it. In that case I’d assume that the edge case testing has been successful? Or do you have any questions or issues around that edge case? In general, you need to allocate enough memory to your ECS tasks, otherwise when the flow run container dies due to OOM on ECS, then the Flow’s heartbeat is lost and Prefect doesn’t know whether the flow run ended in success/failure or was manually cancelled because it couldn’t monitor the flow run till the end. You could then manually mark this flow run as Failed if you wish using the UI or GraphQL API. A better alternative for it would be to assign enough resources to the ECS tasks. so that OOM doesn’t happen in the first place, allowing Prefect to infer the flow run state correctly.

jack

12/06/2021, 4:26 PM

@Anna Geller Not successful The prefect logs say

No heartbeat detected from the remote task; marking the run as failed

But the run is not actually marked as failed. Prefect still shows the flow run with state Running.

Kevin Kho

12/06/2021, 4:30 PM

Chatted with the team and this log looks like a specific task is losing communication but the Flow Run is still ongoing. Do you use a DaskExecutor or is there a task that runs on different compute like an API call? Do you have other tasks in the Flow that are still ongoing?

jack

12/06/2021, 4:31 PM

There is only one task in the flow.

jack

12/06/2021, 4:32 PM

Wait...let me verify that

jack

12/06/2021, 4:38 PM

There is only one task in the flow.

jack

12/06/2021, 4:38 PM

Not using a DaskExecutor.

jack

12/06/2021, 4:39 PM

Using Docker storage type with ECSRun

jack

12/06/2021, 4:40 PM

state_code

and

posix_timestamp

shown in the above screenshot are Parameters.

actual_work

is the only task

Kevin Kho

12/06/2021, 4:59 PM

We opened an internal issue for this. Would you be able to share your flow code (with sensitive stuff redacted) with me through DM so I can add it to the issue?

jack

12/06/2021, 5:05 PM

Yes

3 Views

Open in Slack

Previous Next