Description of problem: ClusterAutoscaler doesn't scale down when a node is not needed anymore Version-Release number of selected component (if applicable): Client Version: 4.6.0-0.nightly-2020-08-26-032807 Server Version: 4.6.0-0.nightly-2020-08-26-032807 Kubernetes Version: v1.19.0-rc.2+aaf4ce1-dirty How reproducible: Every time Steps to Reproduce: Cluster setup: 2 deployed workers and 1 only provisioned worker * Full instruction are in the test case attached. 1. Create a new bmh and wait for it be to be in a "Ready" state 2. Create a new ClusterAutoscaler 3. Create a new MachineAutoscaler 4. Create a stress deployment(custom image which runs the "stress" tool to simulate memory pressure) that requests 6500Mi of memory for each container(So that each node could only have 1 pod running), and then create 1 pod from this deployment. 5. Scale the stress deployment to 3 replicas pods 6. Two pods should be running(1 on each worker), but 1 application pod is pending because the cluster does not have enough resources to schedule it 7. After some time a new machine should be created, bmh should become provisioned and a new worker should be created. Now all 3 pods are running (1 on each of the workers). 8. Scale the stress deployment to 2 replicas pods Actual results: The newly created pod is deleted, but even after waiting more then an hour the machineset doesn't scale down. * scale-down-delay-after-add: This parameter is responsible on how long after scale up that scale down evaluation resumes (default to 10 min) * scale-down-unneeded-time: How long a node should be unneeded before it is eligible for scale down (default to 10 min) Expected results: Based on the scale-down-delay-after-add and scale-down-unneeded-time, the machineset should scale down when the new node is not needed anymore. Additional info:
Need must gather.
I deployed a cluster with 3 running workers with no pods running on them. I created a ClusterAutoscaler and ClusterAutoscaler which are configured to start scaling down after 30 seconds of unneeded time. The minimum replicas i specified is 1. So since there are 0 load on the workers i expected it to start scaling down. Instead, When looking in the cluster-autoscaler pod i noticed a log message saying "ignoring 3 nodes unremovable". Link to must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1872659.tar.gz ClusterAutoscaler and MachineAutoscaler manifests are available at https://gist.github.com/dmaizel/17be497171e9a46a4ccc66da7ac8c5a5:
just wanted to drop a comment here, i looked through the must gather info and i didn't see anything that immediately stood out as incorrect with the ClusterAutoscaler or MachineAutoscaler (although it looks like the resource manifests are different than the ones linked in the gist). i also looked through the cluster-autoscaler logs and the Machine and MachineSet resources, nothing looked suspicious to me. @Daniel would you be able to re-run these tests but increase the verbosity of the autoscaler logs by injecting the environment variable `CLUSTER_AUTOSCALER_VERBOSITY=5` into the cluster-autoscaler deployment? this would allow us to see the utilization metrics, and other internal info, about how the scaler is deciding to skip those machines. i'm curious if there might be some discrepancy between what we consider to be no load on a machine versus what the autoscaler thinks. it appears to be failing because it thinks those machines' utilization is above the minimum threshold.
I0916 08:39:43.976179 1 static_autoscaler.go:449] Calculating unneeded nodes I0916 08:39:43.976284 1 pre_filtering_processor.go:57] Skipping worker-0-2 - no node group config I0916 08:39:43.976325 1 pre_filtering_processor.go:57] Skipping master-0-0 - no node group config I0916 08:39:43.976375 1 pre_filtering_processor.go:57] Skipping master-0-1 - no node group config I0916 08:39:43.976409 1 pre_filtering_processor.go:57] Skipping master-0-2 - no node group config I0916 08:39:43.976551 1 pre_filtering_processor.go:57] Skipping worker-0-0 - no node group config I0916 08:39:43.976654 1 pre_filtering_processor.go:57] Skipping worker-0-1 - no node group config
I traced through the autoscaler code and it seems like the reason for that error is likely to be either that the node/machine ProviderID doesn't match, or the machine ownerReference for the machineset is missing, however both seem OK AFAICS: (which we fixed already) $ oc get machine -o json ostest-jmgbm-worker-0-r7qrm | jq .metadata.name,.metadata.ownerReferences,.spec.providerID "ostest-jmgbm-worker-0-r7qrm" [ { "apiVersion": "machine.openshift.io/v1beta1", "blockOwnerDeletion": true, "controller": true, "kind": "MachineSet", "name": "ostest-jmgbm-worker-0", "uid": "f0a8ae43-6089-46db-8a28-10e4c7cd0be9" } ] "baremetalhost:///openshift-machine-api/ostest-worker-1" $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE ostest-jmgbm-worker-0 3 3 3 3 20h $ oc get node worker-1 -o json | jq .spec.providerID "baremetalhost:///openshift-machine-api/ostest-worker-1"
must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1872659.tar.gz
Not an issue anymore.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438