Hi everyone We are seeing zombie kills for no apparent reaso Prefect Community #ask-community

Hi everyone, We are seeing zombie kills for no ap...

Lucas Beck

09/24/2021, 2:24 PM

Hi everyone, We are seeing zombie kills for no apparent reason in our flows that spin up k8s resources. They occur more or less 10% of the times: Anyone else has experienced this? It is quite hard to know what is going on here.

Kevin Kho

09/24/2021, 2:50 PM

Hey @Lucas Beck, this is likely from being unable to secure the hardware for running the flow. Do you have autoscaling enabled?

Lucas Beck

09/24/2021, 3:20 PM

Yes, we do have auto scaling enabled

Kevin Kho

09/24/2021, 3:27 PM

Does the pod actually get created? Are you potentially running out of memory or cpu with the created pods?

Lucas Beck

09/27/2021, 6:39 AM

Hey @Kevin Kho, The pod gets created, and I am pretty sure we are not running out of CPU or memory, as this particular job is quite light and a restart is all it takes for it to work once this happens.

Kevin Kho

09/27/2021, 1:56 PM

Checking with the team about this.

Kevin Kho

09/27/2021, 2:08 PM

Oh sorry I just realized this is a heartbeat issue. Prefect has heartbeats which check if your Flow is alive. If Prefect didn’t have heartbeats, flows that lost communication and die would permanently be shown as Running in the UI. 95% of the time, we have seen “no heartbeat detected” as a result of running out of memory. But we have also seen them happen with long-running jobs. We haven’t had a reproducible example yet from the community (we’d love to get one). If you are confident the task will succeed, you can make it a subflow and then turn off heartbeats for that subflow. We also rolled out a recent change you can try where you can configure heartbeats to be threads instead of processes. The documentation for that is here. This was our most recent effort around those.

Lucas Beck

09/28/2021, 12:08 PM

I will try that, thanks @Kevin Kho

Lucas Beck

09/28/2021, 12:57 PM

Btw @Kevin Kho, Another issue but might as well be related. I noticed that some of the jobs we have that are spinning up multiple k8s jobs have an increasing memory consumption again. I reported this back in the day, but changing the executor from

DaskExecutor

LocalDaskExecutor

seemed to have done the job. It turns out that I still see the memory increase over time for larger jobs, even with the

LocalDaskExecutor

. We want to be able to run 500 + k8s jobs in parallel some times, and in those cases the memory issue happens. To do that we have been setting the scheduler to be on threads and the number of workers to be 500 +. Any ideas on how to tackle the memory increase or solve this in another way, where we can still have the 500 + tasks spinning k8s jobs running in parallel? Ideally we would like to push the number of jobs running in parallel to be up to the tens of thousands some times. PS: This is similar to what was reported here: https://github.com/PrefectHQ/prefect/issues/3966

Lucas Beck

09/28/2021, 1:08 PM

This might be related: https://github.com/dask/distributed/issues/2757 If Dask is the issue, then I guess we are in rough waters and perhaps the only thing to do is to either break the flows down into smaller ones/ allocate a lot of memory and hope the job completes before the leak catches up 😕

Kevin Kho

09/28/2021, 1:33 PM

So this Dask issue I am actively working on myself. I think I should have a PR this week. I identified the problem and but am working through a solution at the moment

Kevin Kho

09/28/2021, 1:53 PM

On the Dask side, I believe they have fixed their memory stuff in 2021.06 and above. Prefect still doesn’t use it efficient so working on it

Kevin Kho

09/28/2021, 10:27 PM

https://github.com/PrefectHQ/prefect/pull/5004

Kevin Kho

09/28/2021, 10:27 PM

I’m not done but just for you to follow

Lucas Beck

10/04/2021, 9:29 AM

Thanks for the update @Kevin Kho 🙏 Looking forward for testing it out once it is done 🙂

Kevin Kho

10/04/2021, 2:06 PM

I think you can test that branch. the code won’t change I think. I just need to write tests. The context is still duplicated but that one is quite hard to rip out. I would need changes in a bunch of places.

Lucas Beck

10/05/2021, 7:13 AM

I was taking a better look at the PR and If I understand it correctly, the changes will move data ahead of time to the workers, reducing some overhead. What then puzzles me is why that would address the issue of slow and constant memory increase? In my head the overhead reduction would reduce the overall memory consumption but I cannot see how it would address what seems to be a memory leak. Maybe I am missing something here?

Kevin Kho

10/05/2021, 1:44 PM

Yeah so the old method was copying everything repeatedly. Everything that went into the task. So if you have 100000, the task definition, context, task runner class, everything gets copied 100000 times. The “memory leak” is all of those taking up space. Even for a task that doesn’t return anything, these occupy more space on the workers. The PR moves it ahead of time such that the task definition, task runner class, and everything shared only needs to get copied once per worker and then used on the worker side

Lucas Beck

10/06/2021, 9:45 AM

😮 ok! I will be busy in the following week, so I will have to postpone testing the branch for now. But will report once we do that 🙂 And thanks for the support!

Lucas Beck

10/07/2021, 7:10 AM

@Thomas Nyegaard-Signori FYI

7 Views

Open in Slack

Previous Next