Bug 2001027

Summary: ClusterAutoscaler with balanceSimilarNodeGroups does not scale even across MachineSet
Product: OpenShift Container Platform Reporter: Simon Reber <sreber>
Component: Cloud ComputeAssignee: Michael McCune <mimccune>
Cloud Compute sub component: Cluster Autoscaler QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact: Jeana Routh <jrouth>
Severity: medium    
Priority: medium CC: aos-bugs, dpateriy, mfedosin, mharri, mimccune, oarribas, zhsun
Version: 4.8   
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
* Previously, the cluster autoscaler did not respect the AWS, IBM Cloud, and Alibaba Cloud topology labels for the CSI drivers when using the Cluster API provider. As a result, nodes with the topology label were not processed properly by the autoscaler when attempting to balance nodes during a scale-out event. With this release, the autoscaler's custom processors are updated so that it respects this label. The autoscaler can now balance similar node groups that are labelled by the the AWS, IBM Cloud, or Alibaba CSI labels. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2001027[*BZ#2001027*])
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-17 19:46:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simon Reber 2021-09-03 15:03:06 UTC
Description of problem:

> $ oc get clusterautoscaler default -o json | jq '.spec'
> {
>   "balanceSimilarNodeGroups": true,
>   "podPriorityThreshold": -10,
>   "resourceLimits": {
>     "cores": {
>       "max": 1024,
>       "min": 8
>     },
>     "maxNodesTotal": 50,
>     "memory": {
>       "max": 8196,
>       "min": 4
>     }
>   },
>   "scaleDown": {
>     "delayAfterAdd": "5m",
>     "delayAfterDelete": "3m",
>     "delayAfterFailure": "30s",
>     "enabled": true,
>     "unneededTime": "60s"
>   }
> }
ClusterAutoscaler configured as above with `balanceSimilarNodeGroups` set to `true`. In addition, having two MachineAutoscaler called A and B represented in Availability Zone A and B. The respective `MachineSet` for A and B have both 3 OpenShift Container Platform - Node(s) running.

When scaling a deployment in a way that two additional OpenShift Container Platform - Node(s) are required, we can that ClusterAutoscaler would add 2 OpenShift Container Platform - Node(s) to `MachineSet` A.

When triggering a next scaling again, requiring 2 additional OpenShift Container Platform - Node(s) we can see that these to OpenShift Container Platform - Node(s) are added to `MachineSet` B to balance the NodeGroups again (so probably working as expected with `balanceSimilarNodeGroups` set to `true` as decision on balancing seems to be done when scaling is required).

 - Also the above can and may be validated

When scaling a deployment in a way that it requires 16 additional OpenShift Container Platform - Node(s) it behaves differently and in a way that is not clearly transparent. Meaning the OpenShift Container Platform - Node(s) are not just scaled in `MachineSet` A or B but actually spread across the 2 but rather uneven.

> $ oc get machine -n openshift-machine-api
> NAME                                           PHASE         TYPE         REGION      ZONE         AGE
> cluster1234567-bkk47-master-0                  Running       m5.xlarge    us-west-1   us-west-1a   8d
> cluster1234567-bkk47-master-1                  Running       m5.xlarge    us-west-1   us-west-1c   8d
> cluster1234567-bkk47-master-2                  Running       m5.xlarge    us-west-1   us-west-1a   8d
> cluster1234567-bkk47-worker-us-west-1a-22vdg   Provisioned   m5.4xlarge   us-west-1   us-west-1a   92s
> cluster1234567-bkk47-worker-us-west-1a-czb5d   Running       m5.4xlarge   us-west-1   us-west-1a   4d8h
> cluster1234567-bkk47-worker-us-west-1a-kxz7w   Provisioned   m5.4xlarge   us-west-1   us-west-1a   92s
> cluster1234567-bkk47-worker-us-west-1a-pl49d   Provisioned   m5.4xlarge   us-west-1   us-west-1a   92s
> cluster1234567-bkk47-worker-us-west-1a-qcksj   Running       m5.4xlarge   us-west-1   us-west-1a   5h20m
> cluster1234567-bkk47-worker-us-west-1a-r7gd5   Running       m5.4xlarge   us-west-1   us-west-1a   2d1h
> cluster1234567-bkk47-worker-us-west-1a-w9l9w   Provisioned   m5.4xlarge   us-west-1   us-west-1a   92s
> cluster1234567-bkk47-worker-us-west-1c-278wl   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-767rz   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-7fkvs   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-9cnfd   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-fl9g6   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-hjpj6   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-jf9kg   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-pr2tf   Running       m5.4xlarge   us-west-1   us-west-1c   2d2h
> cluster1234567-bkk47-worker-us-west-1c-qhgbd   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-sqxd5   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-svv57   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-twfpp   Running       m5.4xlarge   us-west-1   us-west-1c   2d
> cluster1234567-bkk47-worker-us-west-1c-v6sdk   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-z57dk   Provisioned   m5.4xlarge   us-west-1   us-west-1c   110s
> cluster1234567-bkk47-worker-us-west-1c-zqs6v   Running       m5.4xlarge   us-west-1   us-west-1c   4d8h

The above shows how the OpenShift Container Platform - Node(s) are distribbuted among the two `MachineSet`. Meaning we have 4 Nodes in A and 12 Nodes in B, which appears rather uneven.

Thus we are wondering why it's that in such uneven way and whether there is a way to scale large chunks more even across the available `MachineSet`, especially when `balanceSimilarNodeGroups` is set to `true`

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.8.5

How reproducible:

 - Always


Steps to Reproduce:
1. Installing OpenShift Container Platform 4.8.5 in `us-west-1` using IPI
2. Configure ClusterAutoScaler with `balanceSimilarNodeGroups` to `true`
3. Using the below Deployment to scale pods based on the scenario

> $ oc describe deployment strings
> Name:                   strings
> Namespace:              project-10
> CreationTimestamp:      Mon, 30 Aug 2021 09:37:57 +0200
> Labels:                 app=random
>                         component=strings
> Annotations:            deployment.kubernetes.io/revision: 3
> Selector:               app=random,component=strings
> Replicas:               0 desired | 0 updated | 0 total | 0 available | 0 unavailable
> StrategyType:           RollingUpdate
> MinReadySeconds:        0
> RollingUpdateStrategy:  25% max unavailable, 25% max surge
> Pod Template:
>   Labels:  app=random
>            component=strings
>   Containers:
>    strings:
>     Image:      quay.io/rhn_support_sreber/random@sha256:46da5bbc9d994036f98565ab8a3165d7fd9fd4fd2710751985d831c02c8f782a
>     Port:       8080/TCP
>     Host Port:  0/TCP
>     Limits:
>       cpu:     4
>       memory:  4Gi
>     Requests:
>       cpu:        4
>       memory:     4Gi
>     Environment:  <none>
>     Mounts:       <none>
>   Volumes:        <none>
> Conditions:
>   Type           Status  Reason
>   ----           ------  ------
>   Available      True    MinimumReplicasAvailable
>   Progressing    True    NewReplicaSetAvailable
> OldReplicaSets:  <none>
> NewReplicaSet:   strings-7bcbff8f67 (0/0 replicas created)
> Events:
>   Type    Reason             Age   From                   Message
>   ----    ------             ----  ----                   -------
>   Normal  ScalingReplicaSet  94m   deployment-controller  Scaled down replica set strings-7bcbff8f67 to 0


Actual results:

When doing small scaling (requiring small amount of OpenShift Container Platform - Node(s)), it's always done in one `MachineSet` but then with the following scaling again put into balance. But with large number of scaling the scaling is done uneven between the available `MachineSet`.

Expected results:

With `balanceSimilarNodeGroups` set to `true` one would expect that scaling is always balanced as best as possible, especially when a large number of OpenShift Container Platform - Node(s) is required.

Additional info:

Comment 3 Michael McCune 2021-09-07 21:17:48 UTC
assigning this to myself as Simon and i discussed this in chat.

Comment 4 Michael McCune 2021-09-10 13:51:11 UTC
just wanted to leave a comment, i am able to reproduce this and am now digging deeper to understand what is happening.

Comment 5 Michael McCune 2021-09-13 18:53:58 UTC
@Simon, i've gotten a chance to dig in deeper and it is looking like the nodes will only balance on the same pass if the AWS zones are the same on the machinesets. I am not sure if this behavior is the same on other infrastructure providers but it seems consistent on AWS.

my recommendation in the short term is for users who need this balancing on single passes to use machinesets in the zone. i realize this is an imposition for users who wish to have their workloads spread through multiple regions but it seems like a limitation we have to live with currently.

in the longer term, i plan to investigate the issue further in the autoscaler source code and with the kubernetes autoscaling SIG. i have a feeling this is a bug that we could overcome on our implementation of the autoscaler infrastructure provider (clusterapi), but i want to make sure this isn't intended first.

i will leave this bug open until i can determine the proper fix.

Comment 6 Simon Reber 2021-09-15 07:59:12 UTC
(In reply to Michael McCune from comment #5)
> @Simon, i've gotten a chance to dig in deeper and it is looking like the
> nodes will only balance on the same pass if the AWS zones are the same on
> the machinesets. I am not sure if this behavior is the same on other
> infrastructure providers but it seems consistent on AWS.
Do you know how OpenShift Container Platform - Node(s) are being distributed across Availability Zones when doing a large scale-up. Meaning how does it decide where and how much OpenShift Container Platform - Node(s) to bring up if we are deploying something that will require 20 additional OpenShift Container Platform - Node(s)? As we have seen, for small scale-up activity it will always do the scale-up in one Availability Zone which is explained now (the why). But with large number of OpenShift Container Platform - Node(s) required, it will still create Nodes in multiple Availability Zones (just not evenly distributed).

Comment 7 Michael McCune 2021-09-15 16:52:34 UTC
(In reply to Simon Reber from comment #6)
> Do you know how OpenShift Container Platform - Node(s) are being distributed
> across Availability Zones when doing a large scale-up. Meaning how does it
> decide where and how much OpenShift Container Platform - Node(s) to bring up
> if we are deploying something that will require 20 additional OpenShift
> Container Platform - Node(s)? As we have seen, for small scale-up activity
> it will always do the scale-up in one Availability Zone which is explained
> now (the why). But with large number of OpenShift Container Platform -
> Node(s) required, it will still create Nodes in multiple Availability Zones
> (just not evenly distributed).

in these cases, the autoscaler uses an expander[0] to determine which node group it should choose for scaling. by default it uses the "random" expander and we do not expose an option to change this. so essentially it will choose a random node group from the ones that could support expansion. i could definitely see value to exposing more of those expander options, but i don't think we would be able to support "price" expander without much more work to the machine controllers.

[0] https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-expanders

Comment 10 Michael McCune 2021-11-02 14:46:56 UTC
i have been researching this issue and i believe i have a solution. after talking with the upstream community and examining the code. i think the issue here is that openshift adds some labels to machines which indicate the zone, and these labels make the autoscaler think the machines are different. the autoscaler is smart enough to ignore the default zone label added by kubernetes, but not the openshift specific label. there is an option to the autoscaler which allows it to ignore specific labels, i think if we add the openshift labels to this then we could get the default functionality back.

i am running some tests on this today, hopefully it will provide fruitful results.

Comment 11 Michael McCune 2021-11-02 21:29:26 UTC
well, it turns out i was wrong about which label was causing the issue. it turns out that /something/ (i have determined what yet) is applying the `topology.ebs.csi.aws.com/zone` label to nodes. this label carries the availability zone as its value, which in turn causes the autoscaler to consider the node groups as different.

this is related the AWS CSI driver, and you can see it mentioned here[0], i would like to do a little more research to determine if other CSI drivers have similar labels. the patch i have up currently will not fix this issue, but once i have done some research we should be able to have a fix that will work.


[0] https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/34f6146bc0353f01442739c6a019379b164bcb17/docs/README.md#features-1

Comment 13 Michael McCune 2021-11-10 14:53:12 UTC
i have talked with the upstream sig autoscaling community and it seems like the best fix for this will be to add a custom nodegroupset processor to the autoscaler. i think this is a good way to approach the solution as it will provide benefit for the upstream cluster-api users as well. i have closed the previous patch and am working on an upstream solution now. once we have the PR merged into the upstream autoscaler, i will cherry pick it back into our fork.

Comment 14 Michael McCune 2021-11-10 22:11:56 UTC
i have created a pull request[0] in the upstream that will address this balancing issue. once it has merged there we will pick it up in the next rebase we do for the autoscaler, so this will hit the 4.10 release. i will update here as it progresses.

[0] https://github.com/kubernetes/autoscaler/pull/4458

Comment 15 Joel Speed 2022-01-14 13:03:23 UTC
The upstream PR has merged now, so this will be brought in by the upstream rebase which is currently in progress

Comment 16 Joel Speed 2022-01-21 10:13:34 UTC
The rebase has been merged and is in the latest nightly, the fix for this should be there as well

Comment 17 sunzhaohua 2022-01-25 09:00:29 UTC
This doesn't work on CCM enabled clusters.
I tested on aws and azure clusters with CCM enabled, balanceSimilarNodeGroups doesn't work as expected, it couldn't split scale-up between similar node groups.
Without CCM, it work as expected.

With CCM:
1. create clusterautoscaler
---------
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  balanceSimilarNodeGroups: true
  resourceLimits:
    maxNodesTotal: 20
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    unneededTime: 10s
2. create 3 machineautoscaler
$ oc get machineautoscaler                                                                                        [16:55:19]
NAME                 REF KIND     REF NAME                              MIN   MAX   AGE
machineautoscaler1   MachineSet   zhsunaws252-krx2n-worker-us-east-2a   1     10    11m
machineautoscaler2   MachineSet   zhsunaws252-krx2n-worker-us-east-2b   1     10    10m
machineautoscaler3   MachineSet   zhsunaws252-krx2n-worker-us-east-2c   1     10    9m54s

3. create workload
4. check autoscaler logs and machineset
$ oc get machineset                                                                                               [16:55:26]
NAME                                  DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunaws252-krx2n-worker-us-east-2a   3         3         3       3           65m
zhsunaws252-krx2n-worker-us-east-2b   10        10        10      10          65m
zhsunaws252-krx2n-worker-us-east-2c   1         1         1       1           65m

I0125 08:46:06.919232       1 klogx.go:86] 44 other pods are also unschedulable
I0125 08:46:09.321940       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2b
I0125 08:46:09.321968       1 scale_up.go:472] Estimated 11 nodes needed in MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2b
I0125 08:46:10.115286       1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2b 1->10 (max: 10)}]
I0125 08:46:10.115315       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2b size to 10
W0125 08:46:25.331226       1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-2rrlc" has no providerID
W0125 08:46:25.331257       1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-5l45r" has no providerID
W0125 08:46:25.331263       1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-qmdpk" has no providerID
W0125 08:46:25.331267       1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-k5dxk" has no providerID
W0125 08:46:25.331273       1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-vg4dc" has no providerID
W0125 08:46:25.331278       1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-j8nzk" has no providerID
W0125 08:46:25.331282       1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-77r74" has no providerID
I0125 08:46:27.730758       1 static_autoscaler.go:334] 2 unregistered nodes present
I0125 08:46:29.535165       1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-lvhvn is unschedulable
I0125 08:46:29.535185       1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-mrg6h is unschedulable
I0125 08:46:29.535191       1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-k2t4x is unschedulable
I0125 08:46:29.535197       1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-w24gt is unschedulable
I0125 08:46:29.535213       1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-bngbx is unschedulable
I0125 08:46:29.535219       1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-8j5m8 is unschedulable
I0125 08:46:29.535225       1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-zmqt7 is unschedulable
I0125 08:46:29.535231       1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-bkx9w is unschedulable
I0125 08:46:29.535236       1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-vbb49 is unschedulable
I0125 08:46:29.535242       1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-k9d2h is unschedulable
I0125 08:46:31.935526       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2a
I0125 08:46:31.935556       1 scale_up.go:472] Estimated 2 nodes needed in MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2a
I0125 08:46:32.733145       1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2a 1->3 (max: 10)}]
I0125 08:46:32.733172       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2a size to 3


Without CCM, same steps as above, work as expected.
I0125 07:35:42.888412       1 klogx.go:86] 44 other pods are also unschedulable
I0125 07:35:42.903206       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus2
I0125 07:35:42.903230       1 scale_up.go:472] Estimated 11 nodes needed in MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus2
I0125 07:35:42.903525       1 scale_up.go:585] Splitting scale-up between 3 similar node groups: {MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus2, MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus, MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus1}
I0125 07:35:42.903553       1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus2 1->5 (max: 10)} {MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus 1->5 (max: 10)} {MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus1 1->4 (max: 10)}]
I0125 07:35:42.903574       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus2 size to 5
I0125 07:35:42.962970       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus size to 5
I0125 07:35:42.979979       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus1 size to 4

Comment 18 Michael McCune 2022-01-25 20:05:49 UTC
thanks Zhaohua, that's really interesting. i would not have expected the CCM to make a difference, but perhaps it is adding labels to the Node objects that are causing the autoscaler to think the node groups are not similar. is there a must-gather that you generated for that test run?

Comment 19 sunzhaohua 2022-01-26 08:28:26 UTC
must-gather for CCM enabled clusters on aws: https://file.rdu.redhat.com/~zhsun/must-gather.local.772286811227018419.zip

Comment 20 Michael McCune 2022-01-26 20:18:29 UTC
@sunzhaohua for some reason i am not able to download from that link, would it be possible to drop in another location, perhaps gdrive?

Comment 23 Michael McCune 2022-01-28 22:31:17 UTC
i have learned much more about this and i believe that the csi topology label associated with the csi host path driver is causing the issue for us.

as implemented by the csi topology enhancement[0], the csi-driver-host-path controller adds its topology label[1] to the nodes. similar to the previous patch, this label contains zone specific information and will be different when the node groups (MachineSets in our case) are in different zones.

when asking about this in the upstream community[2], i was informed that this driver is primarily used in testing, and not expected to be used in production. given that, i'm not sure if adding another exception to the autoscaler is appropriate since the autoscaler has a command line flag `--balancing-ignore-label` for these situations. i will raise this issue at the next sig autoscaling meeting.

@Zhaohua, would you mind running this test again, but enable the autoscaler to use the command line flag `--balancing-ignore-label=topology.hostpath.csi/node`, i have a feeling this is something we should expose to our users, just in case they will add labels to their nodes which could differ.


[0] https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/557-csi-topology
[1] https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/pkg/hostpath/nodeserver.go#L34
[2] https://kubernetes.slack.com/archives/C09QZFCE5/p1643400155928539

Comment 24 sunzhaohua 2022-01-29 09:55:01 UTC
Michael, thank you for the detailed info, I tried again, it doesn't work.

$ oc edit deploy cluster-autoscaler-default
    spec:
      containers:
      - args:
        - --logtostderr
        - --v=1
        - --cloud-provider=clusterapi
        - --namespace=openshift-machine-api
        - --max-nodes-total=20
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10s
        - --scale-down-delay-after-delete=10s
        - --scale-down-delay-after-failure=10s
        - --scale-down-unneeded-time=10s
        - --balance-similar-node-groups=true
        - --balancing-ignore-label=topology.hostpath.csi/node

$ oc get machineset                                                                                                              
NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
windows                             2         2         2       2           7h36m
zhsunaz29-4pkxv-worker-centralus1   3         3         1       1           8h
zhsunaz29-4pkxv-worker-centralus2   10        10        1       1           8h
zhsunaz29-4pkxv-worker-centralus3   1         1         1       1           8h

I0129 09:50:14.228083       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus2
I0129 09:50:14.228108       1 scale_up.go:472] Estimated 11 nodes needed in MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus2
I0129 09:50:15.013538       1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus2 1->10 (max: 10)}]
I0129 09:50:15.013574       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus2 size to 10
I0129 09:50:32.638305       1 static_autoscaler.go:334] 2 unregistered nodes present
I0129 09:50:34.445017       1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-vgbht is unschedulable
...
I0129 09:50:36.839669       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus1
I0129 09:50:36.839710       1 scale_up.go:472] Estimated 2 nodes needed in MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus1
I0129 09:50:37.638862       1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus1 1->3 (max: 10)}]
I0129 09:50:37.638902       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus1 size to 3
I0129 09:50:55.257906       1 static_autoscaler.go:334] 11 unregistered nodes present

Comment 25 Michael McCune 2022-01-29 17:32:25 UTC
thanks Zhaohua, i will keep investigating =)

Comment 26 Joel Speed 2022-02-23 11:22:00 UTC
@zhsun Am I correct in thinking that the rebase resolved this issue? I saw your comment in https://issues.redhat.com/browse/OCPCLOUD-1360

Perhaps we can work out which fix in the upstream fixed this and backport it to 4.9 and 4.8 if it is suitable for backport

Comment 27 sunzhaohua 2022-02-24 09:51:26 UTC
(In reply to Joel Speed from comment #26)
> @zhsun Am I correct in thinking that the rebase resolved this
> issue? I saw your comment in https://issues.redhat.com/browse/OCPCLOUD-1360

Yes, I think so, for in tree cloud provider, after the rebase, this issue was resolved.
But for out-of-tree cloud provider, this issue still exists. 
Tested again in clusterversion: 4.11.0-0.nightly-2022-02-23-185405

Comment 28 Michael McCune 2022-04-22 13:08:59 UTC
just leaving an update, this still requires more investigation. i do not have a good handle on the root cause yet.

Comment 29 Joel Speed 2022-05-26 13:44:57 UTC
This needs further investigation still, we need to compare the inputs to the balance logic when running both in tree and out of tree. As out of tree is not scheduled to GA until at least 4.12/4.13, this is not high priority right now as in-tree is still working

Comment 30 sunzhaohua 2022-06-08 06:44:47 UTC
This doesn't work on Alicloud and IBMcloud as well.

On Alicloud, only work when machinesets in the same zone. created 3 machineautoscaler, balance only in 2 groups in the same zone. If 3 groups in the same zone, work as expected.
$ oc get clusterautoscaler default -o yaml                                                              
...
spec:
  balanceSimilarNodeGroups: true
  resourceLimits:
    maxNodesTotal: 20
  scaleDown:
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    enabled: true
    unneededTime: 10s
$ oc get machineautoscaler 
NAME                 REF KIND     REF NAME                            MIN   MAX   AGE
machineautoscaler1   MachineSet   zhsunali-d6gzp-worker-us-east-1a    1     10    98m
machineautoscaler2   MachineSet   zhsunali-d6gzp-worker-us-east-1b    1     10    98m
machineautoscaler3   MachineSet   zhsunali-d6gzp-worker-us-east-1bb   1     10    98m
$ oc get machineset                                       
NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunali-d6gzp-worker-us-east-1a    1         1         1       1           150m
zhsunali-d6gzp-worker-us-east-1b    8         8         1       1           150m
zhsunali-d6gzp-worker-us-east-1bb   8         8         1       1           150m

I0608 05:41:32.924807       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1bb
I0608 05:41:32.924826       1 scale_up.go:472] Estimated 79 nodes needed in MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1bb
I0608 05:41:32.924835       1 scale_up.go:477] Capping size to max cluster total size (20)
I0608 05:41:33.474409       1 scale_up.go:585] Splitting scale-up between 2 similar node groups: {MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1bb, MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1b}
I0608 05:41:33.875130       1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1bb 1->8 (max: 10)} {MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1b 1->8 (max: 10)}]
I0608 05:41:33.875178       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1bb size to 8
I0608 05:41:34.481987       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1b size to 8

On IBMcloud, created 3 machineautoscaler, balance doesn't work at all even if all machinesets in the same zone.
$ oc get clusterautoscaler default -o yaml                                                               
...
spec:
  balanceSimilarNodeGroups: true
  resourceLimits:
    maxNodesTotal: 20
  scaleDown:
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    enabled: true
    unneededTime: 10s
$  oc get machineautoscaler                                                          
NAME                 REF KIND     REF NAME                       MIN   MAX   AGE
machineautoscaler1   MachineSet   prubenda-ibm1-g6v8c-worker-1   1     10    78m
machineautoscaler2   MachineSet   prubenda-ibm1-g6v8c-worker-2   1     10    78m
machineautoscaler3   MachineSet   prubenda-ibm1-g6v8c-worker-3   1     10    78m

$ oc get machineset                                              
NAME                           DESIRED   CURRENT   READY   AVAILABLE   AGE
prubenda-ibm1-g6v8c-worker-1   10        10        4       4           132m
prubenda-ibm1-g6v8c-worker-2   6         6         1       1           132m
prubenda-ibm1-g6v8c-worker-3   1         1         1       1           132m

I0607 15:55:45.556064       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-1
I0607 15:55:45.556090       1 scale_up.go:472] Estimated 25 nodes needed in MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-1
I0607 15:55:45.556098       1 scale_up.go:477] Capping size to max cluster total size (20)
I0607 15:55:46.339631       1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-1 1->10 (max: 10)}]
I0607 15:55:46.339666       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-1 size to 10
...
I0607 15:56:08.169344       1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-2
I0607 15:56:08.169371       1 scale_up.go:472] Estimated 16 nodes needed in MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-2
I0607 15:56:08.169380       1 scale_up.go:477] Capping size to max cluster total size (20)
I0607 15:56:08.961706       1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-2 1->6 (max: 10)}]
I0607 15:56:08.961741       1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-2 size to 6

Comment 31 Michael McCune 2022-06-08 19:11:33 UTC
@zhsun do you know if Alicloud or IBMcloud are adding custom topology labels to the nodes that are created?

Comment 32 sunzhaohua 2022-06-09 08:52:20 UTC
(In reply to Michael McCune from comment #31)
> @zhsun do you know if Alicloud or IBMcloud are adding custom
> topology labels to the nodes that are created?

Alicloud has topology label "topology.diskplugin.csi.alibabacloud.com/zone"
$ oc get node --show-labels
qili-ali-ldjvs-worker-us-east-1b-zmlvc   Ready    worker   153m    v1.24.0+bb9c2f1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=ecs.g6.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1b,kubernetes.io/arch=amd64,kubernetes.io/hostname=qili-ali-ldjvs-worker-us-east-1b-zmlvc,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=ecs.g6.xlarge,node.openshift.io/os_id=rhcos,topology.diskplugin.csi.alibabacloud.com/zone=us-east-1b,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1b

IBMcloud has extra labels compared to other platforms "ibm-cloud.kubernetes.io/worker-id" and "vpc-block-csi-driver-labels"
$ oc get node --show-labels 
jitli0609ibm-f5vzl-worker-3-8jh6p   Ready    worker   3h35m   v1.24.0+bb9c2f1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2-4x16,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-gb,failure-domain.beta.kubernetes.io/zone=eu-gb-3,ibm-cloud.kubernetes.io/worker-id=07a7_a91f15e9-b528-459f-92c3-646ad67bc396,kubernetes.io/arch=amd64,kubernetes.io/hostname=jitli0609ibm-f5vzl-worker-3-8jh6p,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2-4x16,node.openshift.io/os_id=rhcos,topology.kubernetes.io/region=eu-gb,topology.kubernetes.io/zone=eu-gb-3,vpc-block-csi-driver-labels=true

Comment 33 Joel Speed 2022-06-09 08:59:45 UTC
That will be the issue then, cloud provider specific labels which the autoscaler is unaware of. Do we need to have platform specific logic to ignore labels within the CAO, I guess that's the only way we can make this work reliably

Comment 34 Michael McCune 2022-06-09 16:57:19 UTC
thanks Zhaohua, i agree with Joel we might need to add a patch to the autoscaler to ignore these labels.

given that clusterapi is meant to work on many plaftorms, we could propose an addition to the upstream to add these labels. see https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/clusterapi_nodegroups.go#L26

what concerns me is that while this might fix the new Alicloud and IBMcloud issues, we still have the previous error on AWS (which we should be accounting for all the labels). i wonder if there are other labels being added by the CCM on AWS that we are missing?

Comment 35 Michael McCune 2022-06-09 17:45:41 UTC
i'm not sure if this would be helpful for testing, but i've created a branch with a patch to fix the labels for Alibaba and IBM. see https://github.com/elmiko/kubernetes-autoscaler/tree/bz2001027

Comment 36 Joel Speed 2022-07-18 15:24:04 UTC
This is going to need a little bit more investigation to nail down exactly which labels are causing these issues. We will try to schedule time for this, though it's likely this won't be prioritised this sprint. As this is only an issue currently with CCMs, this isn't urgent right now

Comment 38 Michael McCune 2022-08-18 19:38:00 UTC
PR proposed to upstream, https://github.com/kubernetes/autoscaler/pull/5110

i'm not 100% sure that this will completely solve the problem, but it's a step in the right direction. i will cherry-pick this PR back to our fork once it is merged.

Comment 39 Michael McCune 2022-08-19 14:24:41 UTC
just leaving some thoughts here, i'm doing more deep diving into this issue and i'm wondering if it's possible that since new nodes coming up in a ccm environment will have the "uninitialized" taint until the ccm removes the taint that perhaps this is causing the autoscaler to detect a difference in the type of nodes that will be created with the type of nodes that already exist in the node group.

i don't think this is the root cause, but another path of investigation.

Comment 40 Michael McCune 2022-08-19 18:06:50 UTC
after looking through the code, i am less convinced about my previous theory. i am continuing to explore...

Comment 41 Michael McCune 2022-08-24 19:48:36 UTC
i've done some more testing here and i am able to reproduce the problem on AWS with the CCMs enabled. i am adding some debug information to the autoscaler in hopes that i can see why it thinks these node groups are not similar.

Comment 42 Michael McCune 2022-08-25 20:46:40 UTC
i have tried running this several times and for me it appears that the "topology.hostpath.csi/node" label is causing the issues.

i created a special branch of the autoscaler that will print which parts of the balance algorithm are failing.

test 1

1. run autoscaler with "./cluster-autoscaler-amd64 --cloud-provider=clusterapi --v=4 --balance-similar-node-groups"
2. create workload from this defition:
```
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep
spec:
  replicas: 6
  selector:
    matchLabels:
      app: sleep
  template:
    metadata:
      labels:
        app: sleep
    spec:
      containers:
      - name: sleep
        image: quay.io/elmiko/busybox
        resources:
          limits:
            cpu: 3
        command:
          - sleep
          - "3600"
```
3. examine results

$  oc get machinesets -n openshift-machine-api
NAME                                          DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-dmwpvgb-76ef8-84wjg-worker-us-east-1b   6         6         1       1           81m
ci-ln-dmwpvgb-76ef8-84wjg-worker-us-east-1d   2         2         2       2           81m

logs show this:
I0825 16:29:46.231219  900054 compare_nodegroups.go:89] COMPARATOR -- labels not matching: [topology.hostpath.csi/node] [ip-10-0-193-36.ec2.internal ip-10-0-176-210.ec2.internal]                                        
I0825 16:29:46.231243  900054 compare_nodegroups.go:168] COMPARATOR -- labels do not match                                                                                                                                


test 2

1. run autoscaler with "./cluster-autoscaler-amd64 --cloud-provider=clusterapi --v=4 --balance-similar-node-groups --balancing-ignore-label=topology.hostpath.csi/node"
2. create same workload as test 1
3. examine results

$  oc get machinesets -n openshift-machine-api
NAME                                          DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-dmwpvgb-76ef8-84wjg-worker-us-east-1b   4         4         1       1           92m
ci-ln-dmwpvgb-76ef8-84wjg-worker-us-east-1d   5         5         2       2           92m

logs show no messages of failed comparison.

@zhsun i'm not sure why this test failed for you before, i think you were running on AWS as well. i'm fairly confident that the hostpath label is causing the issue here but i'm not sure about the best solution. the "topology.hostpath.csi/node" label is coming from the hostpath CSI driver which is a non-production storage option. we have a few options

1. propose a change in the upstream to the cluster-api nodegroupset processor to exclude hostpath labels. upstream probably won't mind if we change our own process for this, but should we include the labels for a testing driver?
2. propose a change to the cluster-autoscaler-operator to add the "--balancing-ignore-label=topology.hostpath.csi/node" to our deployment of the autoscaler. this would be a relatively quick change that we could make, but shares the same issue about adding the testing driver.

Comment 43 Michael McCune 2022-08-25 21:07:05 UTC
i've added a PR to solve the IBM and Alibaba issues. we will need to figure out what to do about the hostpath label.

another option i thought of is that we could expose the "--balancing-ignore-label" through our ClusterAutoscaler resource and then the CI clusters could use modified manifests to ensure we aren't processing the hostpath csi label. i think this is probably the best option.

Comment 45 sunzhaohua 2022-08-26 16:04:27 UTC
@mimccune Sorry for the confusion, before I tested this on azure use flag `--balancing-ignore-label=topology.hostpath.csi/node`, it doesn't work. 
Now the label is "topology.csidriver.csi/node", I tested again on aws, gcp and auzre with flag "- --balancing-ignore-label=topology.csidriver.csi/node", it only works on aws.
Will test alicloud and ibmcloud next week.

Clusterversion: 4.12.0-0.nightly-2022-08-24-053339
1. create clusterautoscaler
2. create 3 machineautoscaler
3. $ oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0
4. $ oc scale deployment cluster-autoscaler-operator  --replicas=0
5. $ oc edit deploy cluster-autoscaler-default
      - args:
        - --logtostderr
        - --cloud-provider=clusterapi
        - --namespace=openshift-machine-api
        - --leader-elect-lease-duration=137s
        - --leader-elect-renew-deadline=107s
        - --leader-elect-retry-period=26s
        - --max-nodes-total=20
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10s
        - --scale-down-delay-after-delete=10s
        - --scale-down-delay-after-failure=10s
        - --scale-down-unneeded-time=10s
        - --balance-similar-node-groups=true
        - --balancing-ignore-label=topology.csidriver.csi/node
        - --v=1
6. Create workload.

aws:
$ oc get node --show-labels
ip-10-0-197-214.us-east-2.compute.internal   Ready    worker                 50m   v1.24.0+ed93380   
beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m6i.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-197-214.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m6i.xlarge,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=ip-10-0-197-214.us-east-2.compute.internal,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c      
$ oc get machinesets.machine                                                                  
NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunaws8262-gzkgz-worker-us-east-2a   6         6         1       1           69m
zhsunaws8262-gzkgz-worker-us-east-2b   5         5         1       1           69m
zhsunaws8262-gzkgz-worker-us-east-2c   6         6         1       1           69m

gcp:
$ oc get node --show-labels
zhsungcp826-rrq2m-worker-c-cpkl7.c.openshift-qe.internal   Ready    worker                 49m   v1.24.0+ed93380   
beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n2-standard-4,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-c,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsungcp826-rrq2m-worker-c-cpkl7.c.openshift-qe.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=n2-standard-4,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=zhsungcp826-rrq2m-worker-c-cpkl7.c.openshift-qe.internal,topology.gke.io/zone=us-central1-c,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-c
$ oc get machinesets.machine                                                   
NAME                          DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsungcp-412-6xfr6-worker-a   6         6         1       1           71m
zhsungcp-412-6xfr6-worker-b   10        10        1       1           71m
zhsungcp-412-6xfr6-worker-c   1         1         1       1           71m

azure:
$ oc get node --show-labels
zhsunazure-412-mssd4-worker-eastus3-b52j5   Ready    worker                 11m   v1.24.0+ed93380   
beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eastus,failure-domain.beta.kubernetes.io/zone=eastus-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsunazure-412-mssd4-worker-eastus3-b52j5,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=zhsunazure-412-mssd4-worker-eastus3-b52j5,topology.disk.csi.azure.com/zone=eastus-3,topology.kubernetes.io/region=eastus,topology.kubernetes.io/zone=eastus-3 
$ oc get machinesets.machine                               
NAME                                  DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunazure-412-mssd4-worker-eastus1   6         6         1       1           56m
zhsunazure-412-mssd4-worker-eastus2   10        10        1       1           56m
zhsunazure-412-mssd4-worker-eastus3   1         1         1       1           56m

Comment 46 sunzhaohua 2022-08-29 10:06:31 UTC
@mimccune Ignore Comment 45, I summarized the results for different platforms, PTAL if need ignore label "ibm-cloud.kubernetes.io/vpc-instance-id" on IBMCloud, ignore "topology.gke.io/zone" on GCP, ignore "topology.disk.csi.azure.com/zone" on Azure. And the same label "topology.csidriver.csi/node" on aws/gcp/azure if enable CCM.

1. Alicloud
Verified on Alicloud
 $ oc get machineset                                                                                 
NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunali829-9hvkw-worker-us-east-1a    5         5         2       2           68m
zhsunali829-9hvkw-worker-us-east-1b    6         6         1       1           68m
zhsunali829-9hvkw-worker-us-east-1bb   6         6         1       1           11m
------------------------
2. IBMCloud
Failed on IBMCloud, there is a new label "ibm-cloud.kubernetes.io/vpc-instance-id" than before, if ignore this label, it works well.
$ oc get node --show-labels
zhsunibm829-zp6sh-worker-3-llr7t   Ready      worker                 114s    v1.24.0+a097e26   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2-4x16,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-gb,failure-domain.beta.kubernetes.io/zone=eu-gb-3,ibm-cloud.kubernetes.io/vpc-instance-id=07a7_9a6c1a0a-1435-40cb-a4af-f6f721b74863,ibm-cloud.kubernetes.io/worker-id=07a7_9a6c1a0a-1435-40cb-a4af-f6f721b74863,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsunibm829-zp6sh-worker-3-llr7t,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2-4x16,node.openshift.io/os_id=rhcos,topology.kubernetes.io/region=eu-gb,topology.kubernetes.io/zone=eu-gb-3,vpc-block-csi-driver-labels=true

Didn't ignore label "ibm-cloud.kubernetes.io/vpc-instance-id" 
$ oc get machineset                                              
NAME                         DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunibm829-zp6sh-worker-1   6         6         1       1           90m
zhsunibm829-zp6sh-worker-2   1         1         1       1           90m
zhsunibm829-zp6sh-worker-3   10        10        1       1           90m

Ignore label "ibm-cloud.kubernetes.io/vpc-instance-id" 
        - --balance-similar-node-groups=true
        - --balancing-ignore-label=ibm-cloud.kubernetes.io/vpc-instance-id
$ oc get machineset                         
NAME                         DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunibm829-zp6sh-worker-1   6         6         1       1           110m
zhsunibm829-zp6sh-worker-2   5         5         1       1           110m
zhsunibm829-zp6sh-worker-3   6         6         3       3           110m
------------------------
3. AWS
3.1 AWS without CCM works well
4.12.0-0.nightly-2022-08-27-164831
$ oc get node --show-labels
ip-10-0-201-211.us-east-2.compute.internal   Ready    worker                 12m     v1.24.0+a097e26   
beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m6i.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-201-211.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m6i.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c

$ oc get machineset                                                                                                                         
NAME                                  DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunaws829-zm7h7-worker-us-east-2a   6         6         1       1           6h53m
zhsunaws829-zm7h7-worker-us-east-2b   5         5         1       1           6h53m
zhsunaws829-zm7h7-worker-us-east-2c   6         6         1       1           6h53m

3.2 Failed on AWS enable CCM, by default doesn't work, if "- --balancing-ignore-label=topology.csidriver.csi/node", works well. 
$ oc get node --show-labels
ip-10-0-197-214.us-east-2.compute.internal   Ready    worker                 50m   v1.24.0+ed93380   
beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m6i.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-197-214.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m6i.xlarge,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=ip-10-0-197-214.us-east-2.compute.internal,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c   
$ oc edit deploy cluster-autoscaler-default
        - --balance-similar-node-groups=true
        - --balancing-ignore-label=topology.csidriver.csi/node
$ oc get machinesets.machine                                                                  
NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunaws8262-gzkgz-worker-us-east-2a   6         6         1       1           69m
zhsunaws8262-gzkgz-worker-us-east-2b   5         5         1       1           69m
zhsunaws8262-gzkgz-worker-us-east-2c   6         6         1       1           69m
------------------------
4. GCP
4.1 Failed on GCP without CCM,  by default doesn't work, if "- --balancing-ignore-label=topology.gke.io/zone", works well. 
$ oc get node --show-labels
evakhoni-23262-ln6dj-worker-c-xgfq5.c.openshift-qe.internal   Ready    worker                 115m   v1.24.0+c83b5d0   
beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n2-standard-4,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-c,kubernetes.io/arch=amd64,kubernetes.io/hostname=evakhoni-23262-ln6dj-worker-c-xgfq5.c.openshift-qe.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=n2-standard-4,node.openshift.io/os_id=rhcos,topology.gke.io/zone=us-central1-c,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-c
$ oc edit deploy cluster-autoscaler-default
        - --balance-similar-node-groups=true
        - --balancing-ignore-label=topology.gke.io/zone
$ oc get machineset                                                                                             
NAME                              DESIRED   CURRENT   READY   AVAILABLE   AGE
shudi-412gcpkd99-ffnwr-worker-a   6         6         1       1           7h26m
shudi-412gcpkd99-ffnwr-worker-b   5         5         1       1           7h26m
shudi-412gcpkd99-ffnwr-worker-c   6         6         1       1           7h26m
shudi-412gcpkd99-ffnwr-worker-f   0         0                             7h26m

4.2 Failed on GCP enable CCM, by default doesn't work, if "- --balancing-ignore-label=topology.gke.io/zone, - --balancing-ignore-label=topology.csidriver.csi/node", works well.
$ oc get node --show-labels
zhsungcp-412-829-td9hs-worker-c-74rg2.c.openshift-qe.internal   Ready    worker                 80m   v1.24.0+a097e26   
beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n2-standard-4,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-c,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsungcp-412-829-td9hs-worker-c-74rg2.c.openshift-qe.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=n2-standard-4,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=zhsungcp-412-829-td9hs-worker-c-74rg2.c.openshift-qe.internal,topology.gke.io/zone=us-central1-c,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-c
$ oc edit deploy cluster-autoscaler-default
        - --balance-similar-node-groups=true
        - --balancing-ignore-label=topology.csidriver.csi/node
        - --balancing-ignore-label=topology.gke.io/zone
$ oc get machinesets.machine                                                                                                                
NAME                              DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsungcp-412-829-td9hs-worker-a   5         5         1       1           95m
zhsungcp-412-829-td9hs-worker-b   6         6         1       1           95m
zhsungcp-412-829-td9hs-worker-c   6         6         1       1           95m
zhsungcp-412-829-td9hs-worker-f   0         0                             95m
-------------
5. Azure
5.1 Failed on Azure without CCM,  by default doesn't work, if "- --balancing-ignore-label=topology.disk.csi.azure.com/zone", works well.
$ oc get node --show-labels
mihuangazf0829-rtc72-worker-eastus3-t4zcb   Ready      worker                 5h43m   v1.24.0+a097e26   
beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eastus,failure-domain.beta.kubernetes.io/zone=eastus-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=mihuangazf0829-rtc72-worker-eastus3-t4zcb,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.disk.csi.azure.com/zone=eastus-3,topology.kubernetes.io/region=eastus,topology.kubernetes.io/zone=eastus-3
$ oc edit deploy cluster-autoscaler-default
        - --balance-similar-node-groups=true
        - --balancing-ignore-label=topology.disk.csi.azure.com/zone
 $ oc get machineset                                                                                            
NAME                                  DESIRED   CURRENT   READY   AVAILABLE   AGE
mihuangazf0829-rtc72-worker-eastus1   6         6         1       1           6h12m
mihuangazf0829-rtc72-worker-eastus2   6         6         1       1           6h12m
mihuangazf0829-rtc72-worker-eastus3   5         5         1       1           6h12m

5.2 Failed on Azure enable CCM, by default doesn't work, if "- --balancing-ignore-label=topology.disk.csi.azure.com/zone, - --balancing-ignore-label=topology.csidriver.csi/node", works well.
$ oc get node --show-labels
zhsunazure826-rgcjq-worker-eastus2-nl9t7   Ready    worker                 41m   v1.24.0+ed93380   
beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eastus,failure-domain.beta.kubernetes.io/zone=eastus-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsunazure826-rgcjq-worker-eastus2-nl9t7,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=zhsunazure826-rgcjq-worker-eastus2-nl9t7,topology.disk.csi.azure.com/zone=eastus-2,topology.kubernetes.io/region=eastus,topology.kubernetes.io/zone=eastus-2
$ oc edit deploy cluster-autoscaler-default
        - --balance-similar-node-groups=true
        - --balancing-ignore-label=topology.csidriver.csi/node
        - --balancing-ignore-label=topology.disk.csi.azure.com/zone
$ oc get machinesets.machine                                                                                                 
NAME                                  DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunazure-412-4gf2c-worker-eastus1   6         6         1       1           47m
zhsunazure-412-4gf2c-worker-eastus2   6         6         1       1           47m
zhsunazure-412-4gf2c-worker-eastus3   5         5         1       1           47m

Comment 47 Michael McCune 2022-08-29 19:18:19 UTC
thanks @zhsun , i will create a new patch to upstream with these values to ignore as well:

"ibm-cloud.kubernetes.io/vpc-instance-id"
"topology.gke.io/zone"
"topology.disk.csi.azure.com/zone"
"topology.csidriver.csi/node"

thanks again for taking the time to enumerate them for me =)

Comment 48 Michael McCune 2022-08-30 13:59:07 UTC
i have created a patch for upstream, https://github.com/kubernetes/autoscaler/pull/5148, once it has merged i will cherry pick into our autoscaler.

this PR will cover everything except the "topology.csidriver.csi/node" label. after some discussions i have learned that this label is used exclusively by our shared storage driver, and also its use might not be proper. there is PR open to remove this label from 4.11 and future releases, see https://github.com/openshift/csi-driver-shared-resource/pull/111

@zhsun , we will have to manually exclude the "topology.csidriver.csi/node" label for the time being. we also have this issue, https://issues.redhat.com/browse/OCPCLOUD-1427, which should make it easier to use the ignore labels in the future.

i will update again here once i am able to cherry-pick the change from upstream.

Comment 49 Michael McCune 2022-09-12 17:43:18 UTC
i have some new information which will make this bug more complicated for us to fix, apologies in advance ;)

we had a meeting with the cluster-api community today (see https://www.youtube.com/watch?v=jbhca_9oPuQ) to talk about the balancing feature. the community would like to see us not encoding these ignored labels into the autoscaler and instead would prefer to see great documentation and deployment artifacts to help users know when to add labels to the ignore list.

what this means for our bug is that we will need to first finish the work to expose the balancing ignore labels through the ClusterAutoscaler CRD (see https://issues.redhat.com/browse/OCPCLOUD-1427). then we need to update the CAO to deploy the proper labels when the autoscaler is deployed. then we can revisit this bug and test to ensure we have fixed the issue.

this might take some time to get all the pieces in place, but i will update this bug as we make progress.

Comment 50 Michael McCune 2022-09-14 17:43:59 UTC
i've created a couple jira cards to track the work associated with this bug:

https://issues.redhat.com/browse/OCPCLOUD-1669
https://issues.redhat.com/browse/OCPCLOUD-1670

1670 will be required to make this work on openshift

Comment 51 sunzhaohua 2022-10-10 14:49:53 UTC
balancingIgnoredLabels works well for pr  https://github.com/openshift/cluster-autoscaler-operator/pull/251 in https://issues.redhat.com/browse/OCPCLOUD-1427

Found another issue: if there are machinesets which scales up from 0, the autoscaler will first balance in these machinesets, after they are full, then other node groups.

such as the testing on gcp: 
$ oc get machineautoscaler                                                                                                                                                     
NAME                  REF KIND     REF NAME                    MIN   MAX   AGE
machineautoscaler-a   MachineSet   zhsungcp10-lmfbm-worker-a   1     10    3m41s
machineautoscaler-b   MachineSet   zhsungcp10-lmfbm-worker-b   1     10    3m55s
machineautoscaler-f   MachineSet   zhsungcp10-lmfbm-worker-f   0     10    4m15s

Add workload:
Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f size to 10
Splitting scale-up between 2 similar node groups: {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b, MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-a}
I1010 07:46:11.862566       1 scale_up.go:601] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b 1->3 (max: 10)} {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-a 1->3 (max: 10)}]

--------
$ oc get machineautoscaler                                                                                          
NAME                  REF KIND     REF NAME                    MIN   MAX   AGE
machineautoscaler-a   MachineSet   zhsungcp10-lmfbm-worker-a   1     10    22m
machineautoscaler-b   MachineSet   zhsungcp10-lmfbm-worker-b   0     9     22m
machineautoscaler-f   MachineSet   zhsungcp10-lmfbm-worker-f   0     9     22m

Add workload:
 Capping size to max cluster total size (20)
I1010 08:32:19.557095       1 scale_up.go:591] Splitting scale-up between 2 similar node groups: {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b, MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f}
I1010 08:32:19.557132       1 scale_up.go:601] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b 0->8 (max: 9)} {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f 0->7 (max: 9)}]
I1010 08:32:19.557149       1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b size to 8
I1010 08:32:20.161422       1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f size to 7

$ oc get machineset                                                                   
NAME                        DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsungcp10-lmfbm-worker-a   1         1         1       1           123m
zhsungcp10-lmfbm-worker-b   8         8                             123m
zhsungcp10-lmfbm-worker-c   1         1         1       1           123m
zhsungcp10-lmfbm-worker-f   7         7                             123m

------------
$ oc get machineset                                                                                                      
NAME                        DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsungcp10-lmfbm-worker-a   1         1         1       1           128m
zhsungcp10-lmfbm-worker-b   0         0                             128m
zhsungcp10-lmfbm-worker-c   1         1         1       1           128m
zhsungcp10-lmfbm-worker-f   0         0                             128m

$ oc get machineautoscaler                                                                                                 
NAME                  REF KIND     REF NAME                    MIN   MAX   AGE
machineautoscaler-a   MachineSet   zhsungcp10-lmfbm-worker-a   1     20    39m
machineautoscaler-b   MachineSet   zhsungcp10-lmfbm-worker-b   0     19    39m
machineautoscaler-c   MachineSet   zhsungcp10-lmfbm-worker-c   1     20    13s
machineautoscaler-f   MachineSet   zhsungcp10-lmfbm-worker-f   0     19    39m

Add workload:
I1010 08:39:48.865639       1 scale_up.go:481] Estimated 26 nodes needed in MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b
I1010 08:39:48.865645       1 scale_up.go:486] Capping size to max cluster total size (30)
I1010 08:39:49.605437       1 scale_up.go:591] Splitting scale-up between 2 similar node groups: {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b, MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f}
I1010 08:39:49.605472       1 scale_up.go:601] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b 0->13 (max: 19)} {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f 0->12 (max: 19)}]
I1010 08:39:49.605492       1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b size to 13
I1010 08:39:50.209449       1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f size to 12

$ oc get machineset                                                                                        
NAME                        DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsungcp10-lmfbm-worker-a   1         1         1       1           130m
zhsungcp10-lmfbm-worker-b   13        13                            130m
zhsungcp10-lmfbm-worker-c   1         1         1       1           130m
zhsungcp10-lmfbm-worker-f   12        12                            130m

Comment 52 Michael McCune 2022-10-11 21:23:30 UTC
thanks for the update Zhaohua, i am in awe of your ability to find new bugs with this issue XD

i talked with the team today and i'm thinking that maybe we should open a new bug for the scale from zero balancing issue. my concern here is that we are finding so many errors in this balancing bug that we are overloading the content here. my hope is that by scoping a bug more narrowly on the scale from zero activity we could make these more discoverable in the future.

what do you think about completing the original bug here and then opening a new one?

Comment 53 sunzhaohua 2022-10-12 08:00:23 UTC
I agree Michiael, I close this one and opened a new one to trace the scale from zero balancing issue https://issues.redhat.com/browse/OCPBUGS-2257

Comment 56 errata-xmlrpc 2023-01-17 19:46:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399