Hide Forgot
Description of problem: Autoscaler shouldn't scale down based on scale down utilization threshold, but it will remove nodes. For example, the nodes will be removed even if I set utilizationThreshold: "0.001". Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2022-01-25-023600 How reproducible: Always Steps to Reproduce: 1.Create clusterautoscaler apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s utilizationThreshold: "0.001" 2.Create machineautoscaler liuhuali@Lius-MacBook-Pro huali-test % oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE machineautoscaler MachineSet huliu-061-qvps9-47656 1 3 134m 3. $ oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0 $ oc scale deployment cluster-autoscaler-operator --replicas=0 $ oc edit deploy cluster-autoscaler-default - --v=4 4. Create workload to scale up 5. Waiting the node join the cluster, then delete workload 6. Check if machine could be scale down, and check autoscaler logs Actual results: Nodes will be removed even if we set “--scale-down-utilization-threshold=0.001”. spec: containers: - args: - --logtostderr - --v=4 - --cloud-provider=clusterapi - --namespace=openshift-machine-api - --scale-down-enabled=true - --scale-down-delay-after-add=10s - --scale-down-delay-after-delete=10s - --scale-down-delay-after-failure=10s - --scale-down-unneeded-time=10s - --scale-down-utilization-threshold=0.001 Add workload, machineset huliu-061-qvps9-47656 scales up to 3 machines liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-061-qvps9-47656-bdt4p Running ecs.g6.large us-east-1 us-east-1a 10m huliu-061-qvps9-47656-nbbdf Running ecs.g6.large us-east-1 us-east-1a 20m huliu-061-qvps9-47656-nm7l7 Running ecs.g6.large us-east-1 us-east-1a 20m huliu-061-qvps9-master-0 Running ecs.g6.xlarge us-east-1 us-east-1b 5h1m huliu-061-qvps9-master-1 Running ecs.g6.xlarge us-east-1 us-east-1a 5h1m huliu-061-qvps9-master-2 Running ecs.g6.xlarge us-east-1 us-east-1b 5h1m huliu-061-qvps9-worker-us-east-1a-4p5zk Deleting ecs.g6.large us-east-1 us-east-1a 4h55m huliu-061-qvps9-worker-us-east-1a-kcw5p Running ecs.g6.large us-east-1 us-east-1a 145m huliu-061-qvps9-worker-us-east-1b-mmh96 Running ecs.g6.large us-east-1 us-east-1b 4h55m huliu-061-qvps9-worker-us-east-1b-sv4rg Running ecs.g6.large us-east-1 us-east-1b 4h55m Remove workload, machineset huliu-061-qvps9-47656 scales down to 1 machine liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-061-qvps9-47656-bdt4p Ready worker 18m v1.23.0+06791f6 huliu-061-qvps9-master-0 Ready master 5h13m v1.23.0+06791f6 huliu-061-qvps9-master-1 Ready master 5h13m v1.23.0+06791f6 huliu-061-qvps9-master-2 Ready master 5h13m v1.23.0+06791f6 huliu-061-qvps9-worker-us-east-1a-4p5zk Ready,SchedulingDisabled worker 4h53m v1.23.0+06791f6 huliu-061-qvps9-worker-us-east-1a-kcw5p Ready worker 153m v1.23.0+06791f6 huliu-061-qvps9-worker-us-east-1b-mmh96 Ready worker 4h53m v1.23.0+06791f6 huliu-061-qvps9-worker-us-east-1b-sv4rg Ready worker 4h53m v1.23.0+06791f6 liuhuali@Lius-MacBook-Pro huali-test % liuhuali@Lius-MacBook-Pro huali-test % oc logs cluster-autoscaler-default-7d8fdcf4f-dv5l7 | grep utilization … I0127 07:29:44.198266 1 scale_down.go:444] Node huliu-061-qvps9-47656-nm7l7 is not suitable for removal - cpu utilization too big (0.282667) I0127 07:29:44.198369 1 scale_down.go:444] Node huliu-061-qvps9-47656-nbbdf is not suitable for removal - memory utilization too big (0.763669) I0127 07:29:56.010899 1 scale_down.go:444] Node huliu-061-qvps9-47656-nm7l7 is not suitable for removal - cpu utilization too big (0.282667) I0127 07:29:56.010995 1 scale_down.go:444] Node huliu-061-qvps9-47656-nbbdf is not suitable for removal - memory utilization too big (0.763669) I0127 07:30:07.824527 1 scale_down.go:444] Node huliu-061-qvps9-47656-nbbdf is not suitable for removal - memory utilization too big (0.763669) I0127 07:30:07.824667 1 scale_down.go:444] Node huliu-061-qvps9-47656-nm7l7 is not suitable for removal - cpu utilization too big (0.289333) I0127 07:30:19.636137 1 scale_down.go:444] Node huliu-061-qvps9-47656-nm7l7 is not suitable for removal - cpu utilization too big (0.289333) I0127 07:30:19.636228 1 scale_down.go:444] Node huliu-061-qvps9-47656-nbbdf is not suitable for removal - memory utilization too big (0.763669) I0127 07:30:31.448220 1 scale_down.go:444] Node huliu-061-qvps9-47656-nbbdf is not suitable for removal - memory utilization too big (0.763669) I0127 07:30:31.448331 1 scale_down.go:444] Node huliu-061-qvps9-47656-nm7l7 is not suitable for removal - cpu utilization too big (0.282667) Expected results: Cluster autoscaler should uses utilizationThreshold to determine if a node should be scaled down, below which a node can be considered for scale down. Additional info: Similar with https://bugzilla.redhat.com/show_bug.cgi?id=2042265
liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml | grep providerID providerID: alicloud://us-east-1.i-0xi12lyp1f0z869ibsr7 providerID: alicloud://us-east-1.i-0xi7b9pbfh71qq3vtx7p providerID: alicloud://us-east-1.i-0xibcz15pg2covq4hin9 providerID: alicloud://us-east-1.i-0xibcz15pg2covq4hina providerID: alicloud://us-east-1.i-0xi7b9pbfh71qw0z636n providerID: alicloud://us-east-1.i-0xi8mzg10cyv17x9vvv9 providerID: alicloud://us-east-1.i-0xi12lyp1f0z4k1gw4y9 providerID: alicloud://us-east-1.i-0xi7bf33rrl2av42p3yh liuhuali@Lius-MacBook-Pro huali-test % oc get node -o yaml | grep providerID providerID: us-east-1.i-0xi12lyp1f0z869ibsr7 providerID: us-east-1.i-0xi7b9pbfh71qq3vtx7p providerID: us-east-1.i-0xibcz15pg2covq4hin9 providerID: us-east-1.i-0xibcz15pg2covq4hina providerID: us-east-1.i-0xi7b9pbfh71qw0z636n providerID: us-east-1.i-0xi8mzg10cyv17x9vvv9 providerID: us-east-1.i-0xi12lyp1f0z4k1gw4y9 providerID: us-east-1.i-0xi7bf33rrl2av42p3yh liuhuali@Lius-MacBook-Pro huali-test %
This is the same issue as in IBM, the provider ID on the Node (us-east-1.i-0xi8mzg10cyv17x9vvv9) does not match the providerID on the Machine (alicloud://us-east-1.i-0xi8mzg10cyv17x9vvv9). We need to make these consistent, and in this case, I expect that the CCM should be updated as it should be adding an `alicloud` identifier to the Node provider ID so that they are identifiable as Ali instances
Verified clusterversion: 4.11.0-0.nightly-2022-01-27-182501 Tested with above steps, "--scale-down-utilization-threshold" work as expected. If I set utilizationThreshold: "0.001" the nodes will not be removed. And the providerIDs are match. liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml | grep providerID providerID: alicloud://us-east-1.i-0xi12lyp1f0zlxi9lokg providerID: alicloud://us-east-1.i-0xi8mzg10cyvgqb0rj11 providerID: alicloud://us-east-1.i-0xi7bf33rrl2s8kven7o providerID: alicloud://us-east-1.i-0xiful2e3wjlzrnacgnr providerID: alicloud://us-east-1.i-0xi7bf33rrl2tvrssbbe providerID: alicloud://us-east-1.i-0xi7bf33rrl2txqtwdb9 providerID: alicloud://us-east-1.i-0xi8mzg10cyvgw843p0b providerID: alicloud://us-east-1.i-0xi7bf33rrl2sehyqt9l providerID: alicloud://us-east-1.i-0xiful2e3wjly6fe2uwp liuhuali@Lius-MacBook-Pro huali-test % oc get node -o yaml | grep providerID providerID: alicloud://us-east-1.i-0xi12lyp1f0zlxi9lokg providerID: alicloud://us-east-1.i-0xi8mzg10cyvgqb0rj11 providerID: alicloud://us-east-1.i-0xi7bf33rrl2s8kven7o providerID: alicloud://us-east-1.i-0xiful2e3wjlzrnacgnr providerID: alicloud://us-east-1.i-0xi7bf33rrl2tvrssbbe providerID: alicloud://us-east-1.i-0xi7bf33rrl2txqtwdb9 providerID: alicloud://us-east-1.i-0xi8mzg10cyvgw843p0b providerID: alicloud://us-east-1.i-0xi7bf33rrl2sehyqt9l providerID: alicloud://us-east-1.i-0xiful2e3wjly6fe2uwp liuhuali@Lius-MacBook-Pro huali-test %
Also verified on 4.10.0-0.nightly-2022-01-27-221656 Tested with above steps, "--scale-down-utilization-threshold" work as expected. If I set utilizationThreshold: "0.001" the nodes will not be removed. And the providerIDs are match. liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml | grep providerID providerID: alicloud://us-east-1.i-0xi12lyp1f0zqt40mm2f providerID: alicloud://us-east-1.i-0xi12lyp1f0zqt40mm2g providerID: alicloud://us-east-1.i-0xi7bf33rrl2x46mfkpw providerID: alicloud://us-east-1.i-0xi7bf33rrl2xnwxk4ng providerID: alicloud://us-east-1.i-0xi7b9pbfh72dwpyhbly providerID: alicloud://us-east-1.i-0xi7b9pbfh72dwpyhbm2 providerID: alicloud://us-east-1.i-0xi7bf33rrl2xa3prqoq providerID: alicloud://us-east-1.i-0xibcz15pg2dbilw0djn providerID: alicloud://us-east-1.i-0xiful2e3wjm32153s7r liuhuali@Lius-MacBook-Pro huali-test % oc get node -o yaml | grep providerID providerID: alicloud://us-east-1.i-0xi12lyp1f0zqt40mm2f providerID: alicloud://us-east-1.i-0xi12lyp1f0zqt40mm2g providerID: alicloud://us-east-1.i-0xi7bf33rrl2x46mfkpw providerID: alicloud://us-east-1.i-0xi7bf33rrl2xnwxk4ng providerID: alicloud://us-east-1.i-0xi7b9pbfh72dwpyhbly providerID: alicloud://us-east-1.i-0xi7b9pbfh72dwpyhbm2 providerID: alicloud://us-east-1.i-0xi7bf33rrl2xa3prqoq providerID: alicloud://us-east-1.i-0xibcz15pg2dbilw0djn providerID: alicloud://us-east-1.i-0xiful2e3wjm32153s7r liuhuali@Lius-MacBook-Pro huali-test %
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056