How to detect state == FAILED as soon as heartbeat...
# prefect-server
j
How to detect state == FAILED as soon as heartbeat fails? We have a flow set up to run on ECS that purposefully consumes all the available memory on the container. (We want to make sure we can handle such edge cases.) We monitor the prefect logs, and this message comes through:
No heartbeat detected from the remote task; marking the run as failed.
For 20+ minutes following that log message, fetching the flow run state from prefect cloud still shows
<Running: "Running flow.">
Ideally, as soon as the flow run is marked as failed, state from prefect cloud would say Failed. Suggestions?
k
Will ask the team but I’m not sure this can be sped up because the Flow is responsible for updating its state through API calls. If the heartbeat dies, then Cloud/Server notices it lost communication with the Flow and then it will mark the Flow as failed. This delay likely comes from a batching of updates to the database (similar to logs) so it won’t be instant
In Prefect Cloud we have automations that let you dictate certain actions when your Flows hit certain states. “If this fails, send a notification” or “If this fails, trigger this other Flow” so it’s these automations that trigger events based on Flow state.
j
The flow-run that is going right now, 20 minutes ago the logs showed ``No heartbeat detected from the remote task; marking the run as failed.` and the state for the flow still shows as "Running"
k
Will check with the team
j
Yesterday when testing this scenario, it seemed like after 5-10 minutes the flow state would finally come back as Failed. Today they don't seem to arrive at Failed at all. (allowed one to go for 20+ minutes, allowed another to go 30+ minutes)
k
It sounds like there might be a backup or it’s failing to write to the database from some reason. Are other flows able to execute and are your other services heathy (towel)?
j
If we pass in different arguments to the flow (such that it does not consume so much memory), the flow succeeds.
k
Do you have any logs from
towel
, the container with the
Zombie Killer
service?
j
Where would I go to find towel logs?
k
It should be one of the containers spun up by
prefect server start
. You would look for the container logs
j
We are using prefect cloud.
k
Oh sorry thought you were on server cuz this is the server channel. Will follow up with the team tomorrow.
a
@jack based on:
If we pass in different arguments to the flow (such that it does not consume so much memory), the flow succeeds.
it looks like your flow runs are running out of memory, which causes the flow heartbeat’s to be lost. You noticed correctly that when you assign more memory, then this doesn’t happen. Especially when using ECS, I would definitely try and bump up the memory on your flow’s ECS task definition or run configuration. To explain why this behavior happens: flow heartbeats signal to Prefect Cloud that your flow is alive. If Prefect didn’t have heartbeats, flows that lose communication and die would permanently be shown as Running in the UI (this is what you experienced). Since your ECS container dies due to memory issues, flow heartbeats die with it and Prefect has no way of telling whether this flow run in the end failed/succeeded or was manually cancelled.
j
Hi @Anna Geller The workloads we run vary in size. We have been purposefully testing the edge case where ECS runs out of memory so that we won't be surprised in the future when one of our production flows runs out of memory. We only offered that "with different parameters the flow completes normally" because Kevin had asked "Are other flows able to execute and are your other services heathy?" We had hoped that since the prefect log says
No heartbeat detected from the remote task; marking the run as failed
that when we query prefect for the flow state, it would also say the flow has failed. So far that is not happening.
a
@jack Got it. In that case I’d assume that the edge case testing has been successful? Or do you have any questions or issues around that edge case? In general, you need to allocate enough memory to your ECS tasks, otherwise when the flow run container dies due to OOM on ECS, then the Flow’s heartbeat is lost and Prefect doesn’t know whether the flow run ended in success/failure or was manually cancelled because it couldn’t monitor the flow run till the end. You could then manually mark this flow run as Failed if you wish using the UI or GraphQL API. A better alternative for it would be to assign enough resources to the ECS tasks. so that OOM doesn’t happen in the first place, allowing Prefect to infer the flow run state correctly.
j
@Anna Geller Not successful The prefect logs say
No heartbeat detected from the remote task; marking the run as failed
But the run is not actually marked as failed. Prefect still shows the flow run with state Running.
k
Chatted with the team and this log looks like a specific task is losing communication but the Flow Run is still ongoing. Do you use a DaskExecutor or is there a task that runs on different compute like an API call? Do you have other tasks in the Flow that are still ongoing?
j
There is only one task in the flow.
Wait...let me verify that
There is only one task in the flow.
Not using a DaskExecutor.
Using Docker storage type with ECSRun
state_code
and
posix_timestamp
shown in the above screenshot are Parameters.
actual_work
is the only task
k
We opened an internal issue for this. Would you be able to share your flow code (with sensitive stuff redacted) with me through DM so I can add it to the issue?
j
Yes