Hi everyone, We are seeing zombie kills for no ap...
# ask-community
l
Hi everyone, We are seeing zombie kills for no apparent reason in our flows that spin up k8s resources. They occur more or less 10% of the times: Anyone else has experienced this? It is quite hard to know what is going on here.
k
Hey @Lucas Beck, this is likely from being unable to secure the hardware for running the flow. Do you have autoscaling enabled?
l
Yes, we do have auto scaling enabled
k
Does the pod actually get created? Are you potentially running out of memory or cpu with the created pods?
l
Hey @Kevin Kho, The pod gets created, and I am pretty sure we are not running out of CPU or memory, as this particular job is quite light and a restart is all it takes for it to work once this happens.
k
Checking with the team about this.
Oh sorry I just realized this is a heartbeat issue. Prefect has heartbeats which check if your Flow is alive. If Prefect didn’t have heartbeats, flows that lost communication and die would permanently be shown as Running in the UI. 95% of the time, we have seen “no heartbeat detected” as a result of running out of memory. But we have also seen them happen with long-running jobs. We haven’t had a reproducible example yet from the community (we’d love to get one). If you are confident the task will succeed, you can make it a subflow and then turn off heartbeats for that subflow. We also rolled out a recent change you can try where you can configure heartbeats to be threads instead of processes. The documentation for that is here. This was our most recent effort around those.
l
I will try that, thanks @Kevin Kho
Btw @Kevin Kho, Another issue but might as well be related. I noticed that some of the jobs we have that are spinning up multiple k8s jobs have an increasing memory consumption again. I reported this back in the day, but changing the executor from
DaskExecutor
to
LocalDaskExecutor
seemed to have done the job. It turns out that I still see the memory increase over time for larger jobs, even with the
LocalDaskExecutor
. We want to be able to run 500 + k8s jobs in parallel some times, and in those cases the memory issue happens. To do that we have been setting the scheduler to be on threads and the number of workers to be 500 +. Any ideas on how to tackle the memory increase or solve this in another way, where we can still have the 500 + tasks spinning k8s jobs running in parallel? Ideally we would like to push the number of jobs running in parallel to be up to the tens of thousands some times. PS: This is similar to what was reported here: https://github.com/PrefectHQ/prefect/issues/3966
This might be related: https://github.com/dask/distributed/issues/2757 If Dask is the issue, then I guess we are in rough waters and perhaps the only thing to do is to either break the flows down into smaller ones/ allocate a lot of memory and hope the job completes before the leak catches up 😕
k
So this Dask issue I am actively working on myself. I think I should have a PR this week. I identified the problem and but am working through a solution at the moment
On the Dask side, I believe they have fixed their memory stuff in 2021.06 and above. Prefect still doesn’t use it efficient so working on it
I’m not done but just for you to follow
l
Thanks for the update @Kevin Kho 🙏 Looking forward for testing it out once it is done 🙂
k
I think you can test that branch. the code won’t change I think. I just need to write tests. The context is still duplicated but that one is quite hard to rip out. I would need changes in a bunch of places.
l
I was taking a better look at the PR and If I understand it correctly, the changes will move data ahead of time to the workers, reducing some overhead. What then puzzles me is why that would address the issue of slow and constant memory increase? In my head the overhead reduction would reduce the overall memory consumption but I cannot see how it would address what seems to be a memory leak. Maybe I am missing something here?
k
Yeah so the old method was copying everything repeatedly. Everything that went into the task. So if you have 100000, the task definition, context, task runner class, everything gets copied 100000 times. The “memory leak” is all of those taking up space. Even for a task that doesn’t return anything, these occupy more space on the workers. The PR moves it ahead of time such that the task definition, task runner class, and everything shared only needs to get copied once per worker and then used on the worker side
l
😮 ok! I will be busy in the following week, so I will have to postpone testing the branch for now. But will report once we do that 🙂 And thanks for the support!
@Thomas Nyegaard-Signori FYI