Bug 1880930
| Summary: | ClusterAutoscaler not scaling when there are pending pods - only after increasing number of replicas massively | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon Reber <sreber> |
| Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | aarapov, jhou, jspeed, mgugino, rsandu |
| Version: | 4.4 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-11-30 15:33:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Simon Reber
2020-09-21 07:38:43 UTC
I can't find machineAutoscaler/clusterAutoscaler resources in the must gather logs. The cluster autoscaler logs doesn't show any nginx-deployment-* pod but rather one named nginx-app-1-build. Scale up cycles all are returning with: "2020-09-16T08:58:21.943253101Z I0916 08:58:21.943194 1 scale_up.go:423] No expansion options" How long was it since you scaled up to 40 replicas? I've been working on an issue that seems to be very similar to this one, https://bugzilla.redhat.com/show_bug.cgi?id=1891551 Is it possible that while the scaling up decisions are being made, that the nodes in the node groups that would have matched the pods, were all unhealthy? If they were unhealthy, then the autoscaler discards them and assumes that there will be a healthy node within the node group, or that no node in the node group is every healthy so won't scale up |