2046683 – [AliCloud]"--scale-down-utilization-threshold" doesn't work on AliCloud

Bug 2046683 - [AliCloud]"--scale-down-utilization-threshold" doesn't work on AliCloud

Summary: [AliCloud]"--scale-down-utilization-threshold" doesn't work on AliCloud

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Joel Speed
QA Contact:	Huali Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-27 08:37 UTC by Huali Liu
Modified:	2022-03-12 04:42 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-12 04:41:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2931	0	None	open	Bug 2046683: Ensure correct providerID format for Alibaba nodes	2022-01-27 09:57:43 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-12 04:42:08 UTC

Description Huali Liu 2022-01-27 08:37:52 UTC

Description of problem:
Autoscaler shouldn't scale down based on scale down utilization threshold, but it will remove nodes. For example, the nodes will be removed even if I set utilizationThreshold: "0.001".

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-01-25-023600

How reproducible:
Always

Steps to Reproduce:
1.Create clusterautoscaler
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    unneededTime: 10s
    utilizationThreshold: "0.001"

2.Create machineautoscaler 
liuhuali@Lius-MacBook-Pro huali-test % oc get machineautoscaler
NAME                REF KIND     REF NAME                MIN   MAX   AGE
machineautoscaler   MachineSet   huliu-061-qvps9-47656   1     3     134m

3. $ oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0
$ oc scale deployment cluster-autoscaler-operator --replicas=0
$ oc edit deploy cluster-autoscaler-default
        - --v=4

4. Create workload to scale up 
5. Waiting the node join the cluster, then delete workload
6. Check if machine could be scale down, and check autoscaler logs


Actual results:
Nodes will be removed even if we set “--scale-down-utilization-threshold=0.001”.

    spec:
      containers:
      - args:
        - --logtostderr
        - --v=4
        - --cloud-provider=clusterapi
        - --namespace=openshift-machine-api
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10s
        - --scale-down-delay-after-delete=10s
        - --scale-down-delay-after-failure=10s
        - --scale-down-unneeded-time=10s
        - --scale-down-utilization-threshold=0.001
Add workload, machineset huliu-061-qvps9-47656 scales up to 3 machines
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                      PHASE      TYPE            REGION      ZONE         AGE
huliu-061-qvps9-47656-bdt4p               Running    ecs.g6.large    us-east-1   us-east-1a   10m
huliu-061-qvps9-47656-nbbdf               Running    ecs.g6.large    us-east-1   us-east-1a   20m
huliu-061-qvps9-47656-nm7l7               Running    ecs.g6.large    us-east-1   us-east-1a   20m
huliu-061-qvps9-master-0                  Running    ecs.g6.xlarge   us-east-1   us-east-1b   5h1m
huliu-061-qvps9-master-1                  Running    ecs.g6.xlarge   us-east-1   us-east-1a   5h1m
huliu-061-qvps9-master-2                  Running    ecs.g6.xlarge   us-east-1   us-east-1b   5h1m
huliu-061-qvps9-worker-us-east-1a-4p5zk   Deleting   ecs.g6.large    us-east-1   us-east-1a   4h55m
huliu-061-qvps9-worker-us-east-1a-kcw5p   Running    ecs.g6.large    us-east-1   us-east-1a   145m
huliu-061-qvps9-worker-us-east-1b-mmh96   Running    ecs.g6.large    us-east-1   us-east-1b   4h55m
huliu-061-qvps9-worker-us-east-1b-sv4rg   Running    ecs.g6.large    us-east-1   us-east-1b   4h55m


Remove workload, machineset huliu-061-qvps9-47656 scales down to 1 machine
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                      STATUS                     ROLES    AGE     VERSION
huliu-061-qvps9-47656-bdt4p               Ready                      worker   18m     v1.23.0+06791f6
huliu-061-qvps9-master-0                  Ready                      master   5h13m   v1.23.0+06791f6
huliu-061-qvps9-master-1                  Ready                      master   5h13m   v1.23.0+06791f6
huliu-061-qvps9-master-2                  Ready                      master   5h13m   v1.23.0+06791f6
huliu-061-qvps9-worker-us-east-1a-4p5zk   Ready,SchedulingDisabled   worker   4h53m   v1.23.0+06791f6
huliu-061-qvps9-worker-us-east-1a-kcw5p   Ready                      worker   153m    v1.23.0+06791f6
huliu-061-qvps9-worker-us-east-1b-mmh96   Ready                      worker   4h53m   v1.23.0+06791f6
huliu-061-qvps9-worker-us-east-1b-sv4rg   Ready                      worker   4h53m   v1.23.0+06791f6
liuhuali@Lius-MacBook-Pro huali-test % 


liuhuali@Lius-MacBook-Pro huali-test % oc logs cluster-autoscaler-default-7d8fdcf4f-dv5l7  | grep utilization
…
I0127 07:29:44.198266       1 scale_down.go:444] Node huliu-061-qvps9-47656-nm7l7 is not suitable for removal - cpu utilization too big (0.282667)
I0127 07:29:44.198369       1 scale_down.go:444] Node huliu-061-qvps9-47656-nbbdf is not suitable for removal - memory utilization too big (0.763669)
I0127 07:29:56.010899       1 scale_down.go:444] Node huliu-061-qvps9-47656-nm7l7 is not suitable for removal - cpu utilization too big (0.282667)
I0127 07:29:56.010995       1 scale_down.go:444] Node huliu-061-qvps9-47656-nbbdf is not suitable for removal - memory utilization too big (0.763669)
I0127 07:30:07.824527       1 scale_down.go:444] Node huliu-061-qvps9-47656-nbbdf is not suitable for removal - memory utilization too big (0.763669)
I0127 07:30:07.824667       1 scale_down.go:444] Node huliu-061-qvps9-47656-nm7l7 is not suitable for removal - cpu utilization too big (0.289333)
I0127 07:30:19.636137       1 scale_down.go:444] Node huliu-061-qvps9-47656-nm7l7 is not suitable for removal - cpu utilization too big (0.289333)
I0127 07:30:19.636228       1 scale_down.go:444] Node huliu-061-qvps9-47656-nbbdf is not suitable for removal - memory utilization too big (0.763669)
I0127 07:30:31.448220       1 scale_down.go:444] Node huliu-061-qvps9-47656-nbbdf is not suitable for removal - memory utilization too big (0.763669)
I0127 07:30:31.448331       1 scale_down.go:444] Node huliu-061-qvps9-47656-nm7l7 is not suitable for removal - cpu utilization too big (0.282667)

Expected results:
Cluster autoscaler should uses utilizationThreshold to determine if a node should be scaled down, below which a node can be considered for scale down.

Additional info:
Similar with https://bugzilla.redhat.com/show_bug.cgi?id=2042265

Comment 1 Huali Liu 2022-01-27 08:57:53 UTC

liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml | grep providerID
    providerID: alicloud://us-east-1.i-0xi12lyp1f0z869ibsr7
    providerID: alicloud://us-east-1.i-0xi7b9pbfh71qq3vtx7p
    providerID: alicloud://us-east-1.i-0xibcz15pg2covq4hin9
    providerID: alicloud://us-east-1.i-0xibcz15pg2covq4hina
    providerID: alicloud://us-east-1.i-0xi7b9pbfh71qw0z636n
    providerID: alicloud://us-east-1.i-0xi8mzg10cyv17x9vvv9
    providerID: alicloud://us-east-1.i-0xi12lyp1f0z4k1gw4y9
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2av42p3yh
liuhuali@Lius-MacBook-Pro huali-test % oc get node -o yaml | grep providerID
    providerID: us-east-1.i-0xi12lyp1f0z869ibsr7
    providerID: us-east-1.i-0xi7b9pbfh71qq3vtx7p
    providerID: us-east-1.i-0xibcz15pg2covq4hin9
    providerID: us-east-1.i-0xibcz15pg2covq4hina
    providerID: us-east-1.i-0xi7b9pbfh71qw0z636n
    providerID: us-east-1.i-0xi8mzg10cyv17x9vvv9
    providerID: us-east-1.i-0xi12lyp1f0z4k1gw4y9
    providerID: us-east-1.i-0xi7bf33rrl2av42p3yh
liuhuali@Lius-MacBook-Pro huali-test %

Comment 2 Joel Speed 2022-01-27 09:06:54 UTC

This is the same issue as in IBM, the provider ID on the Node (us-east-1.i-0xi8mzg10cyv17x9vvv9) does not match the providerID on the Machine (alicloud://us-east-1.i-0xi8mzg10cyv17x9vvv9). We need to make these consistent, and in this case, I expect that the CCM should be updated as it should be adding an `alicloud` identifier to the Node provider ID so that they are identifiable as Ali instances

Comment 5 Huali Liu 2022-01-28 04:16:28 UTC

Verified
clusterversion: 4.11.0-0.nightly-2022-01-27-182501

Tested with above steps, "--scale-down-utilization-threshold" work as expected. If I set utilizationThreshold: "0.001" the nodes will not be removed. And the providerIDs are match.
liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml | grep providerID
    providerID: alicloud://us-east-1.i-0xi12lyp1f0zlxi9lokg
    providerID: alicloud://us-east-1.i-0xi8mzg10cyvgqb0rj11
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2s8kven7o
    providerID: alicloud://us-east-1.i-0xiful2e3wjlzrnacgnr
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2tvrssbbe
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2txqtwdb9
    providerID: alicloud://us-east-1.i-0xi8mzg10cyvgw843p0b
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2sehyqt9l
    providerID: alicloud://us-east-1.i-0xiful2e3wjly6fe2uwp
liuhuali@Lius-MacBook-Pro huali-test % oc get node -o yaml | grep providerID
    providerID: alicloud://us-east-1.i-0xi12lyp1f0zlxi9lokg
    providerID: alicloud://us-east-1.i-0xi8mzg10cyvgqb0rj11
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2s8kven7o
    providerID: alicloud://us-east-1.i-0xiful2e3wjlzrnacgnr
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2tvrssbbe
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2txqtwdb9
    providerID: alicloud://us-east-1.i-0xi8mzg10cyvgw843p0b
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2sehyqt9l
    providerID: alicloud://us-east-1.i-0xiful2e3wjly6fe2uwp
liuhuali@Lius-MacBook-Pro huali-test %

Comment 6 Huali Liu 2022-01-28 08:58:47 UTC

Also verified on 4.10.0-0.nightly-2022-01-27-221656

Tested with above steps, "--scale-down-utilization-threshold" work as expected. If I set utilizationThreshold: "0.001" the nodes will not be removed. And the providerIDs are match.
liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o yaml | grep providerID
    providerID: alicloud://us-east-1.i-0xi12lyp1f0zqt40mm2f
    providerID: alicloud://us-east-1.i-0xi12lyp1f0zqt40mm2g
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2x46mfkpw
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2xnwxk4ng
    providerID: alicloud://us-east-1.i-0xi7b9pbfh72dwpyhbly
    providerID: alicloud://us-east-1.i-0xi7b9pbfh72dwpyhbm2
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2xa3prqoq
    providerID: alicloud://us-east-1.i-0xibcz15pg2dbilw0djn
    providerID: alicloud://us-east-1.i-0xiful2e3wjm32153s7r
liuhuali@Lius-MacBook-Pro huali-test % oc get node -o yaml | grep providerID
    providerID: alicloud://us-east-1.i-0xi12lyp1f0zqt40mm2f
    providerID: alicloud://us-east-1.i-0xi12lyp1f0zqt40mm2g
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2x46mfkpw
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2xnwxk4ng
    providerID: alicloud://us-east-1.i-0xi7b9pbfh72dwpyhbly
    providerID: alicloud://us-east-1.i-0xi7b9pbfh72dwpyhbm2
    providerID: alicloud://us-east-1.i-0xi7bf33rrl2xa3prqoq
    providerID: alicloud://us-east-1.i-0xibcz15pg2dbilw0djn
    providerID: alicloud://us-east-1.i-0xiful2e3wjm32153s7r
liuhuali@Lius-MacBook-Pro huali-test %

Comment 9 errata-xmlrpc 2022-03-12 04:41:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.