Hi all! I have a Kubernetes cluster via AKS that’s running the Kubernetes agent. I’m trying to use the cluster autoscaler on node pools to scale the cluster depending on the cpu and memory requests of the prefect flows. We’re using prefect cloud for orchestration.
The jobs are being submitted fine but they don’t seem to be consistently triggering scale up. Sometimes the exact same flow, with the same resource requests, will trigger the autoscaler, sometimes it won’t.
When I take a look at the pods, the ones that run successfully and the ones that sit in pending have the same resource requests, etc. so they’re definitely feeding through ok to Kubernetes.
We’re not maxing out the node limits of the pools (i.e. this still a problem scaling from 0->1) and the resources of the nodes (cpu/memory) are sufficient to run the flows.
I’m at a loss for other reasons why flow runs that appear identical would sometimes work and sometimes not. I wanted to check if anyone had experienced this or had any ideas of where to look next?