Description of problem: > $ oc get clusterautoscaler default -o json | jq '.spec' > { > "balanceSimilarNodeGroups": true, > "podPriorityThreshold": -10, > "resourceLimits": { > "cores": { > "max": 1024, > "min": 8 > }, > "maxNodesTotal": 50, > "memory": { > "max": 8196, > "min": 4 > } > }, > "scaleDown": { > "delayAfterAdd": "5m", > "delayAfterDelete": "3m", > "delayAfterFailure": "30s", > "enabled": true, > "unneededTime": "60s" > } > } ClusterAutoscaler configured as above with `balanceSimilarNodeGroups` set to `true`. In addition, having two MachineAutoscaler called A and B represented in Availability Zone A and B. The respective `MachineSet` for A and B have both 3 OpenShift Container Platform - Node(s) running. When scaling a deployment in a way that two additional OpenShift Container Platform - Node(s) are required, we can that ClusterAutoscaler would add 2 OpenShift Container Platform - Node(s) to `MachineSet` A. When triggering a next scaling again, requiring 2 additional OpenShift Container Platform - Node(s) we can see that these to OpenShift Container Platform - Node(s) are added to `MachineSet` B to balance the NodeGroups again (so probably working as expected with `balanceSimilarNodeGroups` set to `true` as decision on balancing seems to be done when scaling is required). - Also the above can and may be validated When scaling a deployment in a way that it requires 16 additional OpenShift Container Platform - Node(s) it behaves differently and in a way that is not clearly transparent. Meaning the OpenShift Container Platform - Node(s) are not just scaled in `MachineSet` A or B but actually spread across the 2 but rather uneven. > $ oc get machine -n openshift-machine-api > NAME PHASE TYPE REGION ZONE AGE > cluster1234567-bkk47-master-0 Running m5.xlarge us-west-1 us-west-1a 8d > cluster1234567-bkk47-master-1 Running m5.xlarge us-west-1 us-west-1c 8d > cluster1234567-bkk47-master-2 Running m5.xlarge us-west-1 us-west-1a 8d > cluster1234567-bkk47-worker-us-west-1a-22vdg Provisioned m5.4xlarge us-west-1 us-west-1a 92s > cluster1234567-bkk47-worker-us-west-1a-czb5d Running m5.4xlarge us-west-1 us-west-1a 4d8h > cluster1234567-bkk47-worker-us-west-1a-kxz7w Provisioned m5.4xlarge us-west-1 us-west-1a 92s > cluster1234567-bkk47-worker-us-west-1a-pl49d Provisioned m5.4xlarge us-west-1 us-west-1a 92s > cluster1234567-bkk47-worker-us-west-1a-qcksj Running m5.4xlarge us-west-1 us-west-1a 5h20m > cluster1234567-bkk47-worker-us-west-1a-r7gd5 Running m5.4xlarge us-west-1 us-west-1a 2d1h > cluster1234567-bkk47-worker-us-west-1a-w9l9w Provisioned m5.4xlarge us-west-1 us-west-1a 92s > cluster1234567-bkk47-worker-us-west-1c-278wl Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-767rz Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-7fkvs Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-9cnfd Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-fl9g6 Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-hjpj6 Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-jf9kg Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-pr2tf Running m5.4xlarge us-west-1 us-west-1c 2d2h > cluster1234567-bkk47-worker-us-west-1c-qhgbd Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-sqxd5 Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-svv57 Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-twfpp Running m5.4xlarge us-west-1 us-west-1c 2d > cluster1234567-bkk47-worker-us-west-1c-v6sdk Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-z57dk Provisioned m5.4xlarge us-west-1 us-west-1c 110s > cluster1234567-bkk47-worker-us-west-1c-zqs6v Running m5.4xlarge us-west-1 us-west-1c 4d8h The above shows how the OpenShift Container Platform - Node(s) are distribbuted among the two `MachineSet`. Meaning we have 4 Nodes in A and 12 Nodes in B, which appears rather uneven. Thus we are wondering why it's that in such uneven way and whether there is a way to scale large chunks more even across the available `MachineSet`, especially when `balanceSimilarNodeGroups` is set to `true` Version-Release number of selected component (if applicable): - OpenShift Container Platform 4.8.5 How reproducible: - Always Steps to Reproduce: 1. Installing OpenShift Container Platform 4.8.5 in `us-west-1` using IPI 2. Configure ClusterAutoScaler with `balanceSimilarNodeGroups` to `true` 3. Using the below Deployment to scale pods based on the scenario > $ oc describe deployment strings > Name: strings > Namespace: project-10 > CreationTimestamp: Mon, 30 Aug 2021 09:37:57 +0200 > Labels: app=random > component=strings > Annotations: deployment.kubernetes.io/revision: 3 > Selector: app=random,component=strings > Replicas: 0 desired | 0 updated | 0 total | 0 available | 0 unavailable > StrategyType: RollingUpdate > MinReadySeconds: 0 > RollingUpdateStrategy: 25% max unavailable, 25% max surge > Pod Template: > Labels: app=random > component=strings > Containers: > strings: > Image: quay.io/rhn_support_sreber/random@sha256:46da5bbc9d994036f98565ab8a3165d7fd9fd4fd2710751985d831c02c8f782a > Port: 8080/TCP > Host Port: 0/TCP > Limits: > cpu: 4 > memory: 4Gi > Requests: > cpu: 4 > memory: 4Gi > Environment: <none> > Mounts: <none> > Volumes: <none> > Conditions: > Type Status Reason > ---- ------ ------ > Available True MinimumReplicasAvailable > Progressing True NewReplicaSetAvailable > OldReplicaSets: <none> > NewReplicaSet: strings-7bcbff8f67 (0/0 replicas created) > Events: > Type Reason Age From Message > ---- ------ ---- ---- ------- > Normal ScalingReplicaSet 94m deployment-controller Scaled down replica set strings-7bcbff8f67 to 0 Actual results: When doing small scaling (requiring small amount of OpenShift Container Platform - Node(s)), it's always done in one `MachineSet` but then with the following scaling again put into balance. But with large number of scaling the scaling is done uneven between the available `MachineSet`. Expected results: With `balanceSimilarNodeGroups` set to `true` one would expect that scaling is always balanced as best as possible, especially when a large number of OpenShift Container Platform - Node(s) is required. Additional info:
assigning this to myself as Simon and i discussed this in chat.
just wanted to leave a comment, i am able to reproduce this and am now digging deeper to understand what is happening.
@Simon, i've gotten a chance to dig in deeper and it is looking like the nodes will only balance on the same pass if the AWS zones are the same on the machinesets. I am not sure if this behavior is the same on other infrastructure providers but it seems consistent on AWS. my recommendation in the short term is for users who need this balancing on single passes to use machinesets in the zone. i realize this is an imposition for users who wish to have their workloads spread through multiple regions but it seems like a limitation we have to live with currently. in the longer term, i plan to investigate the issue further in the autoscaler source code and with the kubernetes autoscaling SIG. i have a feeling this is a bug that we could overcome on our implementation of the autoscaler infrastructure provider (clusterapi), but i want to make sure this isn't intended first. i will leave this bug open until i can determine the proper fix.
(In reply to Michael McCune from comment #5) > @Simon, i've gotten a chance to dig in deeper and it is looking like the > nodes will only balance on the same pass if the AWS zones are the same on > the machinesets. I am not sure if this behavior is the same on other > infrastructure providers but it seems consistent on AWS. Do you know how OpenShift Container Platform - Node(s) are being distributed across Availability Zones when doing a large scale-up. Meaning how does it decide where and how much OpenShift Container Platform - Node(s) to bring up if we are deploying something that will require 20 additional OpenShift Container Platform - Node(s)? As we have seen, for small scale-up activity it will always do the scale-up in one Availability Zone which is explained now (the why). But with large number of OpenShift Container Platform - Node(s) required, it will still create Nodes in multiple Availability Zones (just not evenly distributed).
(In reply to Simon Reber from comment #6) > Do you know how OpenShift Container Platform - Node(s) are being distributed > across Availability Zones when doing a large scale-up. Meaning how does it > decide where and how much OpenShift Container Platform - Node(s) to bring up > if we are deploying something that will require 20 additional OpenShift > Container Platform - Node(s)? As we have seen, for small scale-up activity > it will always do the scale-up in one Availability Zone which is explained > now (the why). But with large number of OpenShift Container Platform - > Node(s) required, it will still create Nodes in multiple Availability Zones > (just not evenly distributed). in these cases, the autoscaler uses an expander[0] to determine which node group it should choose for scaling. by default it uses the "random" expander and we do not expose an option to change this. so essentially it will choose a random node group from the ones that could support expansion. i could definitely see value to exposing more of those expander options, but i don't think we would be able to support "price" expander without much more work to the machine controllers. [0] https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-expanders
i have been researching this issue and i believe i have a solution. after talking with the upstream community and examining the code. i think the issue here is that openshift adds some labels to machines which indicate the zone, and these labels make the autoscaler think the machines are different. the autoscaler is smart enough to ignore the default zone label added by kubernetes, but not the openshift specific label. there is an option to the autoscaler which allows it to ignore specific labels, i think if we add the openshift labels to this then we could get the default functionality back. i am running some tests on this today, hopefully it will provide fruitful results.
well, it turns out i was wrong about which label was causing the issue. it turns out that /something/ (i have determined what yet) is applying the `topology.ebs.csi.aws.com/zone` label to nodes. this label carries the availability zone as its value, which in turn causes the autoscaler to consider the node groups as different. this is related the AWS CSI driver, and you can see it mentioned here[0], i would like to do a little more research to determine if other CSI drivers have similar labels. the patch i have up currently will not fix this issue, but once i have done some research we should be able to have a fix that will work. [0] https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/34f6146bc0353f01442739c6a019379b164bcb17/docs/README.md#features-1
i have talked with the upstream sig autoscaling community and it seems like the best fix for this will be to add a custom nodegroupset processor to the autoscaler. i think this is a good way to approach the solution as it will provide benefit for the upstream cluster-api users as well. i have closed the previous patch and am working on an upstream solution now. once we have the PR merged into the upstream autoscaler, i will cherry pick it back into our fork.
i have created a pull request[0] in the upstream that will address this balancing issue. once it has merged there we will pick it up in the next rebase we do for the autoscaler, so this will hit the 4.10 release. i will update here as it progresses. [0] https://github.com/kubernetes/autoscaler/pull/4458
The upstream PR has merged now, so this will be brought in by the upstream rebase which is currently in progress
The rebase has been merged and is in the latest nightly, the fix for this should be there as well
This doesn't work on CCM enabled clusters. I tested on aws and azure clusters with CCM enabled, balanceSimilarNodeGroups doesn't work as expected, it couldn't split scale-up between similar node groups. Without CCM, it work as expected. With CCM: 1. create clusterautoscaler --------- apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: balanceSimilarNodeGroups: true resourceLimits: maxNodesTotal: 20 scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s 2. create 3 machineautoscaler $ oc get machineautoscaler [16:55:19] NAME REF KIND REF NAME MIN MAX AGE machineautoscaler1 MachineSet zhsunaws252-krx2n-worker-us-east-2a 1 10 11m machineautoscaler2 MachineSet zhsunaws252-krx2n-worker-us-east-2b 1 10 10m machineautoscaler3 MachineSet zhsunaws252-krx2n-worker-us-east-2c 1 10 9m54s 3. create workload 4. check autoscaler logs and machineset $ oc get machineset [16:55:26] NAME DESIRED CURRENT READY AVAILABLE AGE zhsunaws252-krx2n-worker-us-east-2a 3 3 3 3 65m zhsunaws252-krx2n-worker-us-east-2b 10 10 10 10 65m zhsunaws252-krx2n-worker-us-east-2c 1 1 1 1 65m I0125 08:46:06.919232 1 klogx.go:86] 44 other pods are also unschedulable I0125 08:46:09.321940 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2b I0125 08:46:09.321968 1 scale_up.go:472] Estimated 11 nodes needed in MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2b I0125 08:46:10.115286 1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2b 1->10 (max: 10)}] I0125 08:46:10.115315 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2b size to 10 W0125 08:46:25.331226 1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-2rrlc" has no providerID W0125 08:46:25.331257 1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-5l45r" has no providerID W0125 08:46:25.331263 1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-qmdpk" has no providerID W0125 08:46:25.331267 1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-k5dxk" has no providerID W0125 08:46:25.331273 1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-vg4dc" has no providerID W0125 08:46:25.331278 1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-j8nzk" has no providerID W0125 08:46:25.331282 1 clusterapi_controller.go:452] Machine "zhsunaws252-krx2n-worker-us-east-2b-77r74" has no providerID I0125 08:46:27.730758 1 static_autoscaler.go:334] 2 unregistered nodes present I0125 08:46:29.535165 1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-lvhvn is unschedulable I0125 08:46:29.535185 1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-mrg6h is unschedulable I0125 08:46:29.535191 1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-k2t4x is unschedulable I0125 08:46:29.535197 1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-w24gt is unschedulable I0125 08:46:29.535213 1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-bngbx is unschedulable I0125 08:46:29.535219 1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-8j5m8 is unschedulable I0125 08:46:29.535225 1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-zmqt7 is unschedulable I0125 08:46:29.535231 1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-bkx9w is unschedulable I0125 08:46:29.535236 1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-vbb49 is unschedulable I0125 08:46:29.535242 1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-k9d2h is unschedulable I0125 08:46:31.935526 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2a I0125 08:46:31.935556 1 scale_up.go:472] Estimated 2 nodes needed in MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2a I0125 08:46:32.733145 1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2a 1->3 (max: 10)}] I0125 08:46:32.733172 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaws252-krx2n-worker-us-east-2a size to 3 Without CCM, same steps as above, work as expected. I0125 07:35:42.888412 1 klogx.go:86] 44 other pods are also unschedulable I0125 07:35:42.903206 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus2 I0125 07:35:42.903230 1 scale_up.go:472] Estimated 11 nodes needed in MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus2 I0125 07:35:42.903525 1 scale_up.go:585] Splitting scale-up between 3 similar node groups: {MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus2, MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus, MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus1} I0125 07:35:42.903553 1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus2 1->5 (max: 10)} {MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus 1->5 (max: 10)} {MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus1 1->4 (max: 10)}] I0125 07:35:42.903574 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus2 size to 5 I0125 07:35:42.962970 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus size to 5 I0125 07:35:42.979979 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaz251-krfxt-worker-northcentralus1 size to 4
thanks Zhaohua, that's really interesting. i would not have expected the CCM to make a difference, but perhaps it is adding labels to the Node objects that are causing the autoscaler to think the node groups are not similar. is there a must-gather that you generated for that test run?
must-gather for CCM enabled clusters on aws: https://file.rdu.redhat.com/~zhsun/must-gather.local.772286811227018419.zip
@sunzhaohua for some reason i am not able to download from that link, would it be possible to drop in another location, perhaps gdrive?
must-gahter: https://drive.google.com/file/d/1astWG_KpQprTTr1rsAplCHBUo8ddZiL-/view?usp=sharing
i have learned much more about this and i believe that the csi topology label associated with the csi host path driver is causing the issue for us. as implemented by the csi topology enhancement[0], the csi-driver-host-path controller adds its topology label[1] to the nodes. similar to the previous patch, this label contains zone specific information and will be different when the node groups (MachineSets in our case) are in different zones. when asking about this in the upstream community[2], i was informed that this driver is primarily used in testing, and not expected to be used in production. given that, i'm not sure if adding another exception to the autoscaler is appropriate since the autoscaler has a command line flag `--balancing-ignore-label` for these situations. i will raise this issue at the next sig autoscaling meeting. @Zhaohua, would you mind running this test again, but enable the autoscaler to use the command line flag `--balancing-ignore-label=topology.hostpath.csi/node`, i have a feeling this is something we should expose to our users, just in case they will add labels to their nodes which could differ. [0] https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/557-csi-topology [1] https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/pkg/hostpath/nodeserver.go#L34 [2] https://kubernetes.slack.com/archives/C09QZFCE5/p1643400155928539
Michael, thank you for the detailed info, I tried again, it doesn't work. $ oc edit deploy cluster-autoscaler-default spec: containers: - args: - --logtostderr - --v=1 - --cloud-provider=clusterapi - --namespace=openshift-machine-api - --max-nodes-total=20 - --scale-down-enabled=true - --scale-down-delay-after-add=10s - --scale-down-delay-after-delete=10s - --scale-down-delay-after-failure=10s - --scale-down-unneeded-time=10s - --balance-similar-node-groups=true - --balancing-ignore-label=topology.hostpath.csi/node $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE windows 2 2 2 2 7h36m zhsunaz29-4pkxv-worker-centralus1 3 3 1 1 8h zhsunaz29-4pkxv-worker-centralus2 10 10 1 1 8h zhsunaz29-4pkxv-worker-centralus3 1 1 1 1 8h I0129 09:50:14.228083 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus2 I0129 09:50:14.228108 1 scale_up.go:472] Estimated 11 nodes needed in MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus2 I0129 09:50:15.013538 1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus2 1->10 (max: 10)}] I0129 09:50:15.013574 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus2 size to 10 I0129 09:50:32.638305 1 static_autoscaler.go:334] 2 unregistered nodes present I0129 09:50:34.445017 1 klogx.go:86] Pod openshift-machine-api/scale-up-5b44697b8f-vgbht is unschedulable ... I0129 09:50:36.839669 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus1 I0129 09:50:36.839710 1 scale_up.go:472] Estimated 2 nodes needed in MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus1 I0129 09:50:37.638862 1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus1 1->3 (max: 10)}] I0129 09:50:37.638902 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunaz29-4pkxv-worker-centralus1 size to 3 I0129 09:50:55.257906 1 static_autoscaler.go:334] 11 unregistered nodes present
thanks Zhaohua, i will keep investigating =)
@zhsun Am I correct in thinking that the rebase resolved this issue? I saw your comment in https://issues.redhat.com/browse/OCPCLOUD-1360 Perhaps we can work out which fix in the upstream fixed this and backport it to 4.9 and 4.8 if it is suitable for backport
(In reply to Joel Speed from comment #26) > @zhsun Am I correct in thinking that the rebase resolved this > issue? I saw your comment in https://issues.redhat.com/browse/OCPCLOUD-1360 Yes, I think so, for in tree cloud provider, after the rebase, this issue was resolved. But for out-of-tree cloud provider, this issue still exists. Tested again in clusterversion: 4.11.0-0.nightly-2022-02-23-185405
just leaving an update, this still requires more investigation. i do not have a good handle on the root cause yet.
This needs further investigation still, we need to compare the inputs to the balance logic when running both in tree and out of tree. As out of tree is not scheduled to GA until at least 4.12/4.13, this is not high priority right now as in-tree is still working
This doesn't work on Alicloud and IBMcloud as well. On Alicloud, only work when machinesets in the same zone. created 3 machineautoscaler, balance only in 2 groups in the same zone. If 3 groups in the same zone, work as expected. $ oc get clusterautoscaler default -o yaml ... spec: balanceSimilarNodeGroups: true resourceLimits: maxNodesTotal: 20 scaleDown: delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s enabled: true unneededTime: 10s $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE machineautoscaler1 MachineSet zhsunali-d6gzp-worker-us-east-1a 1 10 98m machineautoscaler2 MachineSet zhsunali-d6gzp-worker-us-east-1b 1 10 98m machineautoscaler3 MachineSet zhsunali-d6gzp-worker-us-east-1bb 1 10 98m $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsunali-d6gzp-worker-us-east-1a 1 1 1 1 150m zhsunali-d6gzp-worker-us-east-1b 8 8 1 1 150m zhsunali-d6gzp-worker-us-east-1bb 8 8 1 1 150m I0608 05:41:32.924807 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1bb I0608 05:41:32.924826 1 scale_up.go:472] Estimated 79 nodes needed in MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1bb I0608 05:41:32.924835 1 scale_up.go:477] Capping size to max cluster total size (20) I0608 05:41:33.474409 1 scale_up.go:585] Splitting scale-up between 2 similar node groups: {MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1bb, MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1b} I0608 05:41:33.875130 1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1bb 1->8 (max: 10)} {MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1b 1->8 (max: 10)}] I0608 05:41:33.875178 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1bb size to 8 I0608 05:41:34.481987 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/zhsunali-d6gzp-worker-us-east-1b size to 8 On IBMcloud, created 3 machineautoscaler, balance doesn't work at all even if all machinesets in the same zone. $ oc get clusterautoscaler default -o yaml ... spec: balanceSimilarNodeGroups: true resourceLimits: maxNodesTotal: 20 scaleDown: delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s enabled: true unneededTime: 10s $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE machineautoscaler1 MachineSet prubenda-ibm1-g6v8c-worker-1 1 10 78m machineautoscaler2 MachineSet prubenda-ibm1-g6v8c-worker-2 1 10 78m machineautoscaler3 MachineSet prubenda-ibm1-g6v8c-worker-3 1 10 78m $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE prubenda-ibm1-g6v8c-worker-1 10 10 4 4 132m prubenda-ibm1-g6v8c-worker-2 6 6 1 1 132m prubenda-ibm1-g6v8c-worker-3 1 1 1 1 132m I0607 15:55:45.556064 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-1 I0607 15:55:45.556090 1 scale_up.go:472] Estimated 25 nodes needed in MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-1 I0607 15:55:45.556098 1 scale_up.go:477] Capping size to max cluster total size (20) I0607 15:55:46.339631 1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-1 1->10 (max: 10)}] I0607 15:55:46.339666 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-1 size to 10 ... I0607 15:56:08.169344 1 scale_up.go:468] Best option to resize: MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-2 I0607 15:56:08.169371 1 scale_up.go:472] Estimated 16 nodes needed in MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-2 I0607 15:56:08.169380 1 scale_up.go:477] Capping size to max cluster total size (20) I0607 15:56:08.961706 1 scale_up.go:595] Final scale-up plan: [{MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-2 1->6 (max: 10)}] I0607 15:56:08.961741 1 scale_up.go:691] Scale-up: setting group MachineSet/openshift-machine-api/prubenda-ibm1-g6v8c-worker-2 size to 6
@zhsun do you know if Alicloud or IBMcloud are adding custom topology labels to the nodes that are created?
(In reply to Michael McCune from comment #31) > @zhsun do you know if Alicloud or IBMcloud are adding custom > topology labels to the nodes that are created? Alicloud has topology label "topology.diskplugin.csi.alibabacloud.com/zone" $ oc get node --show-labels qili-ali-ldjvs-worker-us-east-1b-zmlvc Ready worker 153m v1.24.0+bb9c2f1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=ecs.g6.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1b,kubernetes.io/arch=amd64,kubernetes.io/hostname=qili-ali-ldjvs-worker-us-east-1b-zmlvc,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=ecs.g6.xlarge,node.openshift.io/os_id=rhcos,topology.diskplugin.csi.alibabacloud.com/zone=us-east-1b,topology.kubernetes.io/region=us-east-1,topology.kubernetes.io/zone=us-east-1b IBMcloud has extra labels compared to other platforms "ibm-cloud.kubernetes.io/worker-id" and "vpc-block-csi-driver-labels" $ oc get node --show-labels jitli0609ibm-f5vzl-worker-3-8jh6p Ready worker 3h35m v1.24.0+bb9c2f1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2-4x16,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-gb,failure-domain.beta.kubernetes.io/zone=eu-gb-3,ibm-cloud.kubernetes.io/worker-id=07a7_a91f15e9-b528-459f-92c3-646ad67bc396,kubernetes.io/arch=amd64,kubernetes.io/hostname=jitli0609ibm-f5vzl-worker-3-8jh6p,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2-4x16,node.openshift.io/os_id=rhcos,topology.kubernetes.io/region=eu-gb,topology.kubernetes.io/zone=eu-gb-3,vpc-block-csi-driver-labels=true
That will be the issue then, cloud provider specific labels which the autoscaler is unaware of. Do we need to have platform specific logic to ignore labels within the CAO, I guess that's the only way we can make this work reliably
thanks Zhaohua, i agree with Joel we might need to add a patch to the autoscaler to ignore these labels. given that clusterapi is meant to work on many plaftorms, we could propose an addition to the upstream to add these labels. see https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/clusterapi_nodegroups.go#L26 what concerns me is that while this might fix the new Alicloud and IBMcloud issues, we still have the previous error on AWS (which we should be accounting for all the labels). i wonder if there are other labels being added by the CCM on AWS that we are missing?
i'm not sure if this would be helpful for testing, but i've created a branch with a patch to fix the labels for Alibaba and IBM. see https://github.com/elmiko/kubernetes-autoscaler/tree/bz2001027
This is going to need a little bit more investigation to nail down exactly which labels are causing these issues. We will try to schedule time for this, though it's likely this won't be prioritised this sprint. As this is only an issue currently with CCMs, this isn't urgent right now
PR proposed to upstream, https://github.com/kubernetes/autoscaler/pull/5110 i'm not 100% sure that this will completely solve the problem, but it's a step in the right direction. i will cherry-pick this PR back to our fork once it is merged.
just leaving some thoughts here, i'm doing more deep diving into this issue and i'm wondering if it's possible that since new nodes coming up in a ccm environment will have the "uninitialized" taint until the ccm removes the taint that perhaps this is causing the autoscaler to detect a difference in the type of nodes that will be created with the type of nodes that already exist in the node group. i don't think this is the root cause, but another path of investigation.
after looking through the code, i am less convinced about my previous theory. i am continuing to explore...
i've done some more testing here and i am able to reproduce the problem on AWS with the CCMs enabled. i am adding some debug information to the autoscaler in hopes that i can see why it thinks these node groups are not similar.
i have tried running this several times and for me it appears that the "topology.hostpath.csi/node" label is causing the issues. i created a special branch of the autoscaler that will print which parts of the balance algorithm are failing. test 1 1. run autoscaler with "./cluster-autoscaler-amd64 --cloud-provider=clusterapi --v=4 --balance-similar-node-groups" 2. create workload from this defition: ``` --- apiVersion: apps/v1 kind: Deployment metadata: name: sleep spec: replicas: 6 selector: matchLabels: app: sleep template: metadata: labels: app: sleep spec: containers: - name: sleep image: quay.io/elmiko/busybox resources: limits: cpu: 3 command: - sleep - "3600" ``` 3. examine results $ oc get machinesets -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-dmwpvgb-76ef8-84wjg-worker-us-east-1b 6 6 1 1 81m ci-ln-dmwpvgb-76ef8-84wjg-worker-us-east-1d 2 2 2 2 81m logs show this: I0825 16:29:46.231219 900054 compare_nodegroups.go:89] COMPARATOR -- labels not matching: [topology.hostpath.csi/node] [ip-10-0-193-36.ec2.internal ip-10-0-176-210.ec2.internal] I0825 16:29:46.231243 900054 compare_nodegroups.go:168] COMPARATOR -- labels do not match test 2 1. run autoscaler with "./cluster-autoscaler-amd64 --cloud-provider=clusterapi --v=4 --balance-similar-node-groups --balancing-ignore-label=topology.hostpath.csi/node" 2. create same workload as test 1 3. examine results $ oc get machinesets -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-dmwpvgb-76ef8-84wjg-worker-us-east-1b 4 4 1 1 92m ci-ln-dmwpvgb-76ef8-84wjg-worker-us-east-1d 5 5 2 2 92m logs show no messages of failed comparison. @zhsun i'm not sure why this test failed for you before, i think you were running on AWS as well. i'm fairly confident that the hostpath label is causing the issue here but i'm not sure about the best solution. the "topology.hostpath.csi/node" label is coming from the hostpath CSI driver which is a non-production storage option. we have a few options 1. propose a change in the upstream to the cluster-api nodegroupset processor to exclude hostpath labels. upstream probably won't mind if we change our own process for this, but should we include the labels for a testing driver? 2. propose a change to the cluster-autoscaler-operator to add the "--balancing-ignore-label=topology.hostpath.csi/node" to our deployment of the autoscaler. this would be a relatively quick change that we could make, but shares the same issue about adding the testing driver.
i've added a PR to solve the IBM and Alibaba issues. we will need to figure out what to do about the hostpath label. another option i thought of is that we could expose the "--balancing-ignore-label" through our ClusterAutoscaler resource and then the CI clusters could use modified manifests to ensure we aren't processing the hostpath csi label. i think this is probably the best option.
@mimccune Sorry for the confusion, before I tested this on azure use flag `--balancing-ignore-label=topology.hostpath.csi/node`, it doesn't work. Now the label is "topology.csidriver.csi/node", I tested again on aws, gcp and auzre with flag "- --balancing-ignore-label=topology.csidriver.csi/node", it only works on aws. Will test alicloud and ibmcloud next week. Clusterversion: 4.12.0-0.nightly-2022-08-24-053339 1. create clusterautoscaler 2. create 3 machineautoscaler 3. $ oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0 4. $ oc scale deployment cluster-autoscaler-operator --replicas=0 5. $ oc edit deploy cluster-autoscaler-default - args: - --logtostderr - --cloud-provider=clusterapi - --namespace=openshift-machine-api - --leader-elect-lease-duration=137s - --leader-elect-renew-deadline=107s - --leader-elect-retry-period=26s - --max-nodes-total=20 - --scale-down-enabled=true - --scale-down-delay-after-add=10s - --scale-down-delay-after-delete=10s - --scale-down-delay-after-failure=10s - --scale-down-unneeded-time=10s - --balance-similar-node-groups=true - --balancing-ignore-label=topology.csidriver.csi/node - --v=1 6. Create workload. aws: $ oc get node --show-labels ip-10-0-197-214.us-east-2.compute.internal Ready worker 50m v1.24.0+ed93380 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m6i.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-197-214.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m6i.xlarge,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=ip-10-0-197-214.us-east-2.compute.internal,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c $ oc get machinesets.machine NAME DESIRED CURRENT READY AVAILABLE AGE zhsunaws8262-gzkgz-worker-us-east-2a 6 6 1 1 69m zhsunaws8262-gzkgz-worker-us-east-2b 5 5 1 1 69m zhsunaws8262-gzkgz-worker-us-east-2c 6 6 1 1 69m gcp: $ oc get node --show-labels zhsungcp826-rrq2m-worker-c-cpkl7.c.openshift-qe.internal Ready worker 49m v1.24.0+ed93380 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n2-standard-4,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-c,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsungcp826-rrq2m-worker-c-cpkl7.c.openshift-qe.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=n2-standard-4,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=zhsungcp826-rrq2m-worker-c-cpkl7.c.openshift-qe.internal,topology.gke.io/zone=us-central1-c,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-c $ oc get machinesets.machine NAME DESIRED CURRENT READY AVAILABLE AGE zhsungcp-412-6xfr6-worker-a 6 6 1 1 71m zhsungcp-412-6xfr6-worker-b 10 10 1 1 71m zhsungcp-412-6xfr6-worker-c 1 1 1 1 71m azure: $ oc get node --show-labels zhsunazure-412-mssd4-worker-eastus3-b52j5 Ready worker 11m v1.24.0+ed93380 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eastus,failure-domain.beta.kubernetes.io/zone=eastus-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsunazure-412-mssd4-worker-eastus3-b52j5,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=zhsunazure-412-mssd4-worker-eastus3-b52j5,topology.disk.csi.azure.com/zone=eastus-3,topology.kubernetes.io/region=eastus,topology.kubernetes.io/zone=eastus-3 $ oc get machinesets.machine NAME DESIRED CURRENT READY AVAILABLE AGE zhsunazure-412-mssd4-worker-eastus1 6 6 1 1 56m zhsunazure-412-mssd4-worker-eastus2 10 10 1 1 56m zhsunazure-412-mssd4-worker-eastus3 1 1 1 1 56m
@mimccune Ignore Comment 45, I summarized the results for different platforms, PTAL if need ignore label "ibm-cloud.kubernetes.io/vpc-instance-id" on IBMCloud, ignore "topology.gke.io/zone" on GCP, ignore "topology.disk.csi.azure.com/zone" on Azure. And the same label "topology.csidriver.csi/node" on aws/gcp/azure if enable CCM. 1. Alicloud Verified on Alicloud $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsunali829-9hvkw-worker-us-east-1a 5 5 2 2 68m zhsunali829-9hvkw-worker-us-east-1b 6 6 1 1 68m zhsunali829-9hvkw-worker-us-east-1bb 6 6 1 1 11m ------------------------ 2. IBMCloud Failed on IBMCloud, there is a new label "ibm-cloud.kubernetes.io/vpc-instance-id" than before, if ignore this label, it works well. $ oc get node --show-labels zhsunibm829-zp6sh-worker-3-llr7t Ready worker 114s v1.24.0+a097e26 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=bx2-4x16,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eu-gb,failure-domain.beta.kubernetes.io/zone=eu-gb-3,ibm-cloud.kubernetes.io/vpc-instance-id=07a7_9a6c1a0a-1435-40cb-a4af-f6f721b74863,ibm-cloud.kubernetes.io/worker-id=07a7_9a6c1a0a-1435-40cb-a4af-f6f721b74863,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsunibm829-zp6sh-worker-3-llr7t,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=bx2-4x16,node.openshift.io/os_id=rhcos,topology.kubernetes.io/region=eu-gb,topology.kubernetes.io/zone=eu-gb-3,vpc-block-csi-driver-labels=true Didn't ignore label "ibm-cloud.kubernetes.io/vpc-instance-id" $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsunibm829-zp6sh-worker-1 6 6 1 1 90m zhsunibm829-zp6sh-worker-2 1 1 1 1 90m zhsunibm829-zp6sh-worker-3 10 10 1 1 90m Ignore label "ibm-cloud.kubernetes.io/vpc-instance-id" - --balance-similar-node-groups=true - --balancing-ignore-label=ibm-cloud.kubernetes.io/vpc-instance-id $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsunibm829-zp6sh-worker-1 6 6 1 1 110m zhsunibm829-zp6sh-worker-2 5 5 1 1 110m zhsunibm829-zp6sh-worker-3 6 6 3 3 110m ------------------------ 3. AWS 3.1 AWS without CCM works well 4.12.0-0.nightly-2022-08-27-164831 $ oc get node --show-labels ip-10-0-201-211.us-east-2.compute.internal Ready worker 12m v1.24.0+a097e26 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m6i.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-201-211.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m6i.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsunaws829-zm7h7-worker-us-east-2a 6 6 1 1 6h53m zhsunaws829-zm7h7-worker-us-east-2b 5 5 1 1 6h53m zhsunaws829-zm7h7-worker-us-east-2c 6 6 1 1 6h53m 3.2 Failed on AWS enable CCM, by default doesn't work, if "- --balancing-ignore-label=topology.csidriver.csi/node", works well. $ oc get node --show-labels ip-10-0-197-214.us-east-2.compute.internal Ready worker 50m v1.24.0+ed93380 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m6i.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-197-214.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m6i.xlarge,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=ip-10-0-197-214.us-east-2.compute.internal,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c $ oc edit deploy cluster-autoscaler-default - --balance-similar-node-groups=true - --balancing-ignore-label=topology.csidriver.csi/node $ oc get machinesets.machine NAME DESIRED CURRENT READY AVAILABLE AGE zhsunaws8262-gzkgz-worker-us-east-2a 6 6 1 1 69m zhsunaws8262-gzkgz-worker-us-east-2b 5 5 1 1 69m zhsunaws8262-gzkgz-worker-us-east-2c 6 6 1 1 69m ------------------------ 4. GCP 4.1 Failed on GCP without CCM, by default doesn't work, if "- --balancing-ignore-label=topology.gke.io/zone", works well. $ oc get node --show-labels evakhoni-23262-ln6dj-worker-c-xgfq5.c.openshift-qe.internal Ready worker 115m v1.24.0+c83b5d0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n2-standard-4,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-c,kubernetes.io/arch=amd64,kubernetes.io/hostname=evakhoni-23262-ln6dj-worker-c-xgfq5.c.openshift-qe.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=n2-standard-4,node.openshift.io/os_id=rhcos,topology.gke.io/zone=us-central1-c,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-c $ oc edit deploy cluster-autoscaler-default - --balance-similar-node-groups=true - --balancing-ignore-label=topology.gke.io/zone $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE shudi-412gcpkd99-ffnwr-worker-a 6 6 1 1 7h26m shudi-412gcpkd99-ffnwr-worker-b 5 5 1 1 7h26m shudi-412gcpkd99-ffnwr-worker-c 6 6 1 1 7h26m shudi-412gcpkd99-ffnwr-worker-f 0 0 7h26m 4.2 Failed on GCP enable CCM, by default doesn't work, if "- --balancing-ignore-label=topology.gke.io/zone, - --balancing-ignore-label=topology.csidriver.csi/node", works well. $ oc get node --show-labels zhsungcp-412-829-td9hs-worker-c-74rg2.c.openshift-qe.internal Ready worker 80m v1.24.0+a097e26 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n2-standard-4,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-c,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsungcp-412-829-td9hs-worker-c-74rg2.c.openshift-qe.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=n2-standard-4,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=zhsungcp-412-829-td9hs-worker-c-74rg2.c.openshift-qe.internal,topology.gke.io/zone=us-central1-c,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-c $ oc edit deploy cluster-autoscaler-default - --balance-similar-node-groups=true - --balancing-ignore-label=topology.csidriver.csi/node - --balancing-ignore-label=topology.gke.io/zone $ oc get machinesets.machine NAME DESIRED CURRENT READY AVAILABLE AGE zhsungcp-412-829-td9hs-worker-a 5 5 1 1 95m zhsungcp-412-829-td9hs-worker-b 6 6 1 1 95m zhsungcp-412-829-td9hs-worker-c 6 6 1 1 95m zhsungcp-412-829-td9hs-worker-f 0 0 95m ------------- 5. Azure 5.1 Failed on Azure without CCM, by default doesn't work, if "- --balancing-ignore-label=topology.disk.csi.azure.com/zone", works well. $ oc get node --show-labels mihuangazf0829-rtc72-worker-eastus3-t4zcb Ready worker 5h43m v1.24.0+a097e26 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eastus,failure-domain.beta.kubernetes.io/zone=eastus-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=mihuangazf0829-rtc72-worker-eastus3-t4zcb,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.disk.csi.azure.com/zone=eastus-3,topology.kubernetes.io/region=eastus,topology.kubernetes.io/zone=eastus-3 $ oc edit deploy cluster-autoscaler-default - --balance-similar-node-groups=true - --balancing-ignore-label=topology.disk.csi.azure.com/zone $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE mihuangazf0829-rtc72-worker-eastus1 6 6 1 1 6h12m mihuangazf0829-rtc72-worker-eastus2 6 6 1 1 6h12m mihuangazf0829-rtc72-worker-eastus3 5 5 1 1 6h12m 5.2 Failed on Azure enable CCM, by default doesn't work, if "- --balancing-ignore-label=topology.disk.csi.azure.com/zone, - --balancing-ignore-label=topology.csidriver.csi/node", works well. $ oc get node --show-labels zhsunazure826-rgcjq-worker-eastus2-nl9t7 Ready worker 41m v1.24.0+ed93380 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eastus,failure-domain.beta.kubernetes.io/zone=eastus-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=zhsunazure826-rgcjq-worker-eastus2-nl9t7,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.csidriver.csi/node=zhsunazure826-rgcjq-worker-eastus2-nl9t7,topology.disk.csi.azure.com/zone=eastus-2,topology.kubernetes.io/region=eastus,topology.kubernetes.io/zone=eastus-2 $ oc edit deploy cluster-autoscaler-default - --balance-similar-node-groups=true - --balancing-ignore-label=topology.csidriver.csi/node - --balancing-ignore-label=topology.disk.csi.azure.com/zone $ oc get machinesets.machine NAME DESIRED CURRENT READY AVAILABLE AGE zhsunazure-412-4gf2c-worker-eastus1 6 6 1 1 47m zhsunazure-412-4gf2c-worker-eastus2 6 6 1 1 47m zhsunazure-412-4gf2c-worker-eastus3 5 5 1 1 47m
thanks @zhsun , i will create a new patch to upstream with these values to ignore as well: "ibm-cloud.kubernetes.io/vpc-instance-id" "topology.gke.io/zone" "topology.disk.csi.azure.com/zone" "topology.csidriver.csi/node" thanks again for taking the time to enumerate them for me =)
i have created a patch for upstream, https://github.com/kubernetes/autoscaler/pull/5148, once it has merged i will cherry pick into our autoscaler. this PR will cover everything except the "topology.csidriver.csi/node" label. after some discussions i have learned that this label is used exclusively by our shared storage driver, and also its use might not be proper. there is PR open to remove this label from 4.11 and future releases, see https://github.com/openshift/csi-driver-shared-resource/pull/111 @zhsun , we will have to manually exclude the "topology.csidriver.csi/node" label for the time being. we also have this issue, https://issues.redhat.com/browse/OCPCLOUD-1427, which should make it easier to use the ignore labels in the future. i will update again here once i am able to cherry-pick the change from upstream.
i have some new information which will make this bug more complicated for us to fix, apologies in advance ;) we had a meeting with the cluster-api community today (see https://www.youtube.com/watch?v=jbhca_9oPuQ) to talk about the balancing feature. the community would like to see us not encoding these ignored labels into the autoscaler and instead would prefer to see great documentation and deployment artifacts to help users know when to add labels to the ignore list. what this means for our bug is that we will need to first finish the work to expose the balancing ignore labels through the ClusterAutoscaler CRD (see https://issues.redhat.com/browse/OCPCLOUD-1427). then we need to update the CAO to deploy the proper labels when the autoscaler is deployed. then we can revisit this bug and test to ensure we have fixed the issue. this might take some time to get all the pieces in place, but i will update this bug as we make progress.
i've created a couple jira cards to track the work associated with this bug: https://issues.redhat.com/browse/OCPCLOUD-1669 https://issues.redhat.com/browse/OCPCLOUD-1670 1670 will be required to make this work on openshift
balancingIgnoredLabels works well for pr https://github.com/openshift/cluster-autoscaler-operator/pull/251 in https://issues.redhat.com/browse/OCPCLOUD-1427 Found another issue: if there are machinesets which scales up from 0, the autoscaler will first balance in these machinesets, after they are full, then other node groups. such as the testing on gcp: $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE machineautoscaler-a MachineSet zhsungcp10-lmfbm-worker-a 1 10 3m41s machineautoscaler-b MachineSet zhsungcp10-lmfbm-worker-b 1 10 3m55s machineautoscaler-f MachineSet zhsungcp10-lmfbm-worker-f 0 10 4m15s Add workload: Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f size to 10 Splitting scale-up between 2 similar node groups: {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b, MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-a} I1010 07:46:11.862566 1 scale_up.go:601] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b 1->3 (max: 10)} {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-a 1->3 (max: 10)}] -------- $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE machineautoscaler-a MachineSet zhsungcp10-lmfbm-worker-a 1 10 22m machineautoscaler-b MachineSet zhsungcp10-lmfbm-worker-b 0 9 22m machineautoscaler-f MachineSet zhsungcp10-lmfbm-worker-f 0 9 22m Add workload: Capping size to max cluster total size (20) I1010 08:32:19.557095 1 scale_up.go:591] Splitting scale-up between 2 similar node groups: {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b, MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f} I1010 08:32:19.557132 1 scale_up.go:601] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b 0->8 (max: 9)} {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f 0->7 (max: 9)}] I1010 08:32:19.557149 1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b size to 8 I1010 08:32:20.161422 1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f size to 7 $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsungcp10-lmfbm-worker-a 1 1 1 1 123m zhsungcp10-lmfbm-worker-b 8 8 123m zhsungcp10-lmfbm-worker-c 1 1 1 1 123m zhsungcp10-lmfbm-worker-f 7 7 123m ------------ $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsungcp10-lmfbm-worker-a 1 1 1 1 128m zhsungcp10-lmfbm-worker-b 0 0 128m zhsungcp10-lmfbm-worker-c 1 1 1 1 128m zhsungcp10-lmfbm-worker-f 0 0 128m $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE machineautoscaler-a MachineSet zhsungcp10-lmfbm-worker-a 1 20 39m machineautoscaler-b MachineSet zhsungcp10-lmfbm-worker-b 0 19 39m machineautoscaler-c MachineSet zhsungcp10-lmfbm-worker-c 1 20 13s machineautoscaler-f MachineSet zhsungcp10-lmfbm-worker-f 0 19 39m Add workload: I1010 08:39:48.865639 1 scale_up.go:481] Estimated 26 nodes needed in MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b I1010 08:39:48.865645 1 scale_up.go:486] Capping size to max cluster total size (30) I1010 08:39:49.605437 1 scale_up.go:591] Splitting scale-up between 2 similar node groups: {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b, MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f} I1010 08:39:49.605472 1 scale_up.go:601] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b 0->13 (max: 19)} {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f 0->12 (max: 19)}] I1010 08:39:49.605492 1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b size to 13 I1010 08:39:50.209449 1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f size to 12 $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsungcp10-lmfbm-worker-a 1 1 1 1 130m zhsungcp10-lmfbm-worker-b 13 13 130m zhsungcp10-lmfbm-worker-c 1 1 1 1 130m zhsungcp10-lmfbm-worker-f 12 12 130m
thanks for the update Zhaohua, i am in awe of your ability to find new bugs with this issue XD i talked with the team today and i'm thinking that maybe we should open a new bug for the scale from zero balancing issue. my concern here is that we are finding so many errors in this balancing bug that we are overloading the content here. my hope is that by scoping a bug more narrowly on the scale from zero activity we could make these more discoverable in the future. what do you think about completing the original bug here and then opening a new one?
I agree Michiael, I close this one and opened a new one to trace the scale from zero balancing issue https://issues.redhat.com/browse/OCPBUGS-2257
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399