Thomas Nyegaard-Signori
11/03/2021, 1:51 PMscale-down-delay-after-add
to 1m and similarly the scale-down-unneeded-time
to 1m.
The issues that we are facing is sometimes the task pods fail, seemingly without reason, and logs are quite unhelpful. My hunch is that is has something to do with scaling of the cluster, potentially destroying pods/losing networking between flow and task pod in the process? We are already setting the <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: false
on all pods, so eviction shouldnt be the issue. Have anyone else had any experiences with k8s autoscaler settings leading to weird, intermittent task failures?Kevin Kho
Thomas Nyegaard-Signori
11/03/2021, 1:59 PMKevin Kho
Mariia Kerimova
11/03/2021, 2:43 PM"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>": "false"
should resolve the issue and prevent from evicting pods during scaling down event. Feels like the issue might be with something else. The common issue I've seen is when pods without cpu and memory requests are being scheduled on nodes without enough resources. Eventually those pods are being killed by kube-scheduler for memory violations, or those pods are running with continuous cpu throttling. If pods were killed by kube-scheduler, you usually can see logs in kubernetes events for that namespace.
Did you try to increase scale down delay to >1m? Also, scale-down-utilization-threshold
could be tweaked as well (default is 0.5).Thomas Nyegaard-Signori
11/04/2021, 7:37 AMscale-down-utilization-threshold
, in my head that should be lowered to something like ~0.2-0.3, would you agree?