Bug 1872659
| Summary: | ClusterAutoscaler doesn't scale down when a node is not needed anymore | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Daniel <dmaizel> |
| Component: | Cloud Compute | Assignee: | Steven Hardy <shardy> |
| Cloud Compute sub component: | BareMetal Provider | QA Contact: | Daniel <dmaizel> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | augol, beth.white, dmaizel, jrouth, mgugino, mimccune, rbartal, shardy, stbenjam |
| Version: | 4.6 | Keywords: | Triaged |
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 22:32:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1853267, 1901040, 1909682 | ||
| Bug Blocks: | |||
|
Description
Daniel
2020-08-26 10:23:53 UTC
Need must gather. I deployed a cluster with 3 running workers with no pods running on them. I created a ClusterAutoscaler and ClusterAutoscaler which are configured to start scaling down after 30 seconds of unneeded time. The minimum replicas i specified is 1. So since there are 0 load on the workers i expected it to start scaling down. Instead, When looking in the cluster-autoscaler pod i noticed a log message saying "ignoring 3 nodes unremovable". Link to must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1872659.tar.gz ClusterAutoscaler and MachineAutoscaler manifests are available at https://gist.github.com/dmaizel/17be497171e9a46a4ccc66da7ac8c5a5: just wanted to drop a comment here, i looked through the must gather info and i didn't see anything that immediately stood out as incorrect with the ClusterAutoscaler or MachineAutoscaler (although it looks like the resource manifests are different than the ones linked in the gist). i also looked through the cluster-autoscaler logs and the Machine and MachineSet resources, nothing looked suspicious to me. @Daniel would you be able to re-run these tests but increase the verbosity of the autoscaler logs by injecting the environment variable `CLUSTER_AUTOSCALER_VERBOSITY=5` into the cluster-autoscaler deployment? this would allow us to see the utilization metrics, and other internal info, about how the scaler is deciding to skip those machines. i'm curious if there might be some discrepancy between what we consider to be no load on a machine versus what the autoscaler thinks. it appears to be failing because it thinks those machines' utilization is above the minimum threshold. I0916 08:39:43.976179 1 static_autoscaler.go:449] Calculating unneeded nodes I0916 08:39:43.976284 1 pre_filtering_processor.go:57] Skipping worker-0-2 - no node group config I0916 08:39:43.976325 1 pre_filtering_processor.go:57] Skipping master-0-0 - no node group config I0916 08:39:43.976375 1 pre_filtering_processor.go:57] Skipping master-0-1 - no node group config I0916 08:39:43.976409 1 pre_filtering_processor.go:57] Skipping master-0-2 - no node group config I0916 08:39:43.976551 1 pre_filtering_processor.go:57] Skipping worker-0-0 - no node group config I0916 08:39:43.976654 1 pre_filtering_processor.go:57] Skipping worker-0-1 - no node group config I traced through the autoscaler code and it seems like the reason for that error is likely to be either that the node/machine ProviderID doesn't match, or the machine ownerReference for the machineset is missing, however both seem OK AFAICS:
(which we fixed already)
$ oc get machine -o json ostest-jmgbm-worker-0-r7qrm | jq .metadata.name,.metadata.ownerReferences,.spec.providerID
"ostest-jmgbm-worker-0-r7qrm"
[
{
"apiVersion": "machine.openshift.io/v1beta1",
"blockOwnerDeletion": true,
"controller": true,
"kind": "MachineSet",
"name": "ostest-jmgbm-worker-0",
"uid": "f0a8ae43-6089-46db-8a28-10e4c7cd0be9"
}
]
"baremetalhost:///openshift-machine-api/ostest-worker-1"
$ oc get machineset
NAME DESIRED CURRENT READY AVAILABLE AGE
ostest-jmgbm-worker-0 3 3 3 3 20h
$ oc get node worker-1 -o json | jq .spec.providerID
"baremetalhost:///openshift-machine-api/ostest-worker-1"
Not an issue anymore. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |