Hello ladies and gentlemen, I run prefect local ag...
# prefect-ui
n
Hello ladies and gentlemen, I run prefect local agent on one machine and use prefect cloud as backend. When I execute flow - agent creates child process which executes my flow. If this child process is killed for some reason (g.e. my Unix supervisor kills it for OOM) agent just retrying as configured, got the same result and give up. In web UI I can see that my task has 'running' status but it's actually do nothing. It's definitely a problem cause you can't really understand where your tasks are running or killed. I would expect agent collects info about killed child process and inform backed so I could at least see something is wrong with agent. Is this problem solved or maybe I can fix it somehow ?
a
Hi @Nikita Samoylov, can you share the flow code that is causing issues for you? You can share a simplified version with no sensitive info
n
I work with huge pandas dfs and sometimes there is not enough memory to process them - I know this is my problem. But I just want to know that agent is killed in such cases and not see 'running' state in UI. Actually I can't understand how my code will help you ))
a
Thanks for sharing. There are many ways to configure your flow in a way to prevent such OOM errors. Processing data in batches, offloading the execution to a remote Dask cluster. I’m trying to collect as much information as I can about your use case to understand the problem you’re trying to solve.
If I understand it correctly, the OOM seems to be the underlying root cause why flows are failing so if we could fix that, there won’t be any zombie processes any more.
In general, we use heartbeats to make sure a flow run is still healthy and mark them as failed if they crash, what heartbeat settings are you using?
k
Hey @Nikita Samoylov, heartbeats are the mechanism for Prefect to mark the flow as failed if it can’t communicate with it. Do you have heartbeats turned off?
n
I use default settings for heartbeat, at least I didn't configure it. Should read about it a little bit.
k
if you go to the flow settings, you can see if it is turned on
I believe server should have this as well
n
I can see in UI that Heartbeat is turned on for my project
k
I am surprised cuz in fact, we get people complaining on the other side that the flow is being marked as failed when it’s still going on. This is the first I’ve seen that the heartbeat is not kicking in. How long as it been in running? Are we talking hours or minutes?
n
Process was killed in 10-15 minutes after start by supervisor. In UI it is marked as running forever, at least first 23 hours after start.
k
Will ask the team for more ideas
You can check the last recorded heartbeat by querying for the flow info using the GraphQL API and you can see if that heartbeat is still running
n
Heartbeat is
Null
, but the task is still running 🥲
Copy code
{
  "data": {
    "flow_run": [
      {
        "id": "0ff97838-e0f1-4eb8-b1f4-66e4a7d8e5cf",
        "name": "spicy-bustard",
        "heartbeat": null
      }
    ]
  }
}