Eduardo Mota

09/11/2023, 10:33 PM
Hi everyone, I have a issue that has been difficult to debug. We are running prefect 2.8.6 in kubernetes on premise. One node has a gpu with massice memory and CPU, so most of our jobs are run here. What we are experiencing is that when submit a batch of 20 jobs it takes 20 minutes to complete, at the end of the 20 minutes CPU goes through the roof and it start utilizing all the CPU in the node make it unusable.... the jobs hav a 1cpu limit and 10GB memory, 20 jobs should be easy for the node with 112 CPUs and 2TB memory.... any ideas to help us troubleshoot, that would be greatly appreacite it