Description of problem: Issue was reported that ClusterAutoscaler would not trigger provisioning of additional OpenShift - Node(s) even though pods were in pending state, waiting for additional capacity. This did remain that way until the deployment was scaled to 40 replicas. After that ClusterAutoscaler triggered provisioning of Nodes to satisfy the needs from the pods. Question though is, why did it not trigger the provisioning earlier, when 2 or more pods were pending and waiting for capacity. Checked all deployment details, including matching node-selector, taints, MachineSet configruation, etc. and did not find any culprit. Also if something like node-selector or similar would have caused the issue it would not trigger the scaling when the replica count is increased by 40. Version-Release number of selected component (if applicable): - 4.4.17 How reproducible: - N/A Steps to Reproduce: 1. N/A Actual results: The ClusterAutoscaler would not trigger provisioning of additional OpenShift - Node(s), even though pods were in pending state, waiting for resources. Expected results: Additional OpenShift - Nodes to be provisioned when pending pods are in pending state, waiting for resources and are matching the criteria for auto-scaling. Additional info:
I can't find machineAutoscaler/clusterAutoscaler resources in the must gather logs. The cluster autoscaler logs doesn't show any nginx-deployment-* pod but rather one named nginx-app-1-build. Scale up cycles all are returning with: "2020-09-16T08:58:21.943253101Z I0916 08:58:21.943194 1 scale_up.go:423] No expansion options" How long was it since you scaled up to 40 replicas?
I've been working on an issue that seems to be very similar to this one, https://bugzilla.redhat.com/show_bug.cgi?id=1891551 Is it possible that while the scaling up decisions are being made, that the nodes in the node groups that would have matched the pods, were all unhealthy? If they were unhealthy, then the autoscaler discards them and assumes that there will be a healthy node within the node group, or that no node in the node group is every healthy so won't scale up