Description of problem: "--balance-similar-node-groups" option doesn't work well. If I have 3 groups, sometimes balanced in 1 group, sometimes 2 group, sometimes 3. Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-07-15-231921 How reproducible: sometimes Steps to Reproduce: 1. Create clusterautoscaler, set "balanceSimilarNodeGroups: true" 2. Create 3 machineautoscaler $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE zhsun-0716-wmwvm-worker-us-east-2a MachineSet zhsun-0716-wmwvm-worker-us-east-2a 1 30 31m zhsun-0716-wmwvm-worker-us-east-2b MachineSet zhsun-0716-wmwvm-worker-us-east-2b 1 30 31m zhsun-0716-wmwvm-worker-us-east-2c MachineSet zhsun-0716-wmwvm-worker-us-east-2c 1 30 31m 3.Add workload $ oc create -f - <<EOF apiVersion: batch/v1 kind: Job metadata: generateName: work-queue- spec: template: spec: containers: - name: work image: busybox command: ["sleep", "86400"] resources: requests: memory: 500Mi cpu: 500m restartPolicy: Never nodeSelector: node-role.kubernetes.io/worker: "" backoffLimit: 4 completions: 100 parallelism: 100 EOF Actual results: Balance only in 1 group or 2 groups. 1 group I0718 04:50:30.779752 1 scale_up.go:426] Estimated 35 nodes needed in openshift-machine-api/zhsun-0715-wljql-worker-us-east-2b I0718 04:50:30.779838 1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun-0715-wljql-worker-us-east-2b 1->30 (max: 30)}] I0718 04:50:30.779861 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun-0715-wljql-worker-us-east-2b size to 30 I0718 04:50:40.819094 1 scale_up.go:262] Pod openshift-machine-config-operator/etcd-quorum-guard-65994dbd87-qf4d5 is unschedulable I0718 04:50:40.819127 1 scale_up.go:262] Pod openshift-machine-api/work-queue-7mprl-mr57n is unschedulable ... I0718 04:50:40.819298 1 scale_up.go:265] 75 other pods are also unschedulable I0718 04:50:40.851359 1 scale_up.go:422] Best option to resize: openshift-machine-api/zhsun-0715-wljql-worker-us-east-2c I0718 04:50:40.851389 1 scale_up.go:426] Estimated 18 nodes needed in openshift-machine-api/zhsun-0715-wljql-worker-us-east-2c I0718 04:50:40.851504 1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun-0715-wljql-worker-us-east-2c 1->19 (max: 30)}] I0718 04:50:40.851539 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun-0715-wljql-worker-us-east-2c size to 19 2 groups I0718 05:39:04.171776 1 scale_up.go:422] Best option to resize: openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2c I0718 05:39:04.171813 1 scale_up.go:426] Estimated 49 nodes needed in openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2c I0718 05:39:04.171957 1 scale_up.go:521] Splitting scale-up between 2 similar node groups: {openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2c, openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2a} I0718 05:39:04.172318 1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2c 1->26 (max: 30)} {openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2a 1->25 (max: 30)}] I0718 05:39:04.172358 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2c size to 26 I0718 05:39:04.196763 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2a size to 25 Expected results: Balance in 3 groups. Additional info:
Before you run the autoscaler test please could you: - scale down the replicas in the '2a' machineset to 0 - scale it up to 1 again - wait for the new machine/node to become Ready - and then start the autoscaler test
This occurs because the memory capacity of one node is different to the capacity in the other nodes. See: https://bugzilla.redhat.com/show_bug.cgi?id=1733235 After installation I see that one of the worker nodes reports a different amount of RAM compared to the other two. If you scale down the machineset to 0, then back to 1, _AND_ after it becomes "Ready" _AND_ it reports the same amount of RAM as the other nodes then I do see the autoscaler report that it is "Splitting scale-up between 3 nodegroups". The default cluster autoscaler logic for determining whether a nodegroup is similar to other nodegroups requires that capacity is identical: https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L79
@andrew I tried following the steps, the result is as expected. - scale down the replicas in the '2a' machineset to 0 - scale it up to 1 again - wait for the new machine/node to become Ready - create clusterautoscaler, machineautoscaler - add workload I0725 14:49:14.483020 1 scale_up.go:426] Estimated 49 nodes needed in openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b I0725 14:49:14.483183 1 scale_up.go:521] Splitting scale-up between 3 similar node groups: {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b, openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c, openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a} I0725 14:49:14.483211 1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b 1->18 (max: 30)} {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c 1->17 (max: 30)} {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a 1->17 (max: 30)}] I0725 14:49:14.483251 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b size to 18 I0725 14:49:14.497758 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c size to 17 I0725 14:49:14.511498 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a size to 17
(In reply to sunzhaohua from comment #3) > @andrew I tried following the steps, the result is as expected. > - scale down the replicas in the '2a' machineset to 0 > - scale it up to 1 again > - wait for the new machine/node to become Ready > - create clusterautoscaler, machineautoscaler > - add workload > > > I0725 14:49:14.483020 1 scale_up.go:426] Estimated 49 nodes needed in > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b > I0725 14:49:14.483183 1 scale_up.go:521] Splitting scale-up between 3 > similar node groups: {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b, > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c, > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a} > I0725 14:49:14.483211 1 scale_up.go:529] Final scale-up plan: > [{openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b 1->18 (max: 30)} > {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c 1->17 (max: 30)} > {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a 1->17 (max: 30)}] > I0725 14:49:14.483251 1 scale_up.go:686] Scale-up: setting group > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b size to 18 > I0725 14:49:14.497758 1 scale_up.go:686] Scale-up: setting group > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c size to 17 > I0725 14:49:14.511498 1 scale_up.go:686] Scale-up: setting group > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a size to 17 The problem still remains that we are getting machines/nodes with varying amounts of RAM - which to me is unexpected, see: https://bugzilla.redhat.com/show_bug.cgi?id=1733235 As far as this bug goes I think the behaviour is correct, thus this is not a bug. If the cluster autoscaler requires memory capacity to be identical when balancing amongst nodes in nodegroups then it can only split the workload up amongst those that are equal. We're seeing the correct behaviour from the cluster autoscaler here.
I looked at adding a 3-5% toleration on the capacity but am now wary of making this change based on the warning here: https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L79 This says: // If this is ever changed, enforcing MaxCoresTotal and MaxMemoryTotal limits // as it is now may no longer work. from the following function: func IsNodeInfoSimilar(n1, n2 *schedulernodeinfo.NodeInfo) bool { capacity := make(map[apiv1.ResourceName][]resource.Quantity) allocatable := make(map[apiv1.ResourceName][]resource.Quantity) free := make(map[apiv1.ResourceName][]resource.Quantity) nodes := []*schedulernodeinfo.NodeInfo{n1, n2} for _, node := range nodes { for res, quantity := range node.Node().Status.Capacity { capacity[res] = append(capacity[res], quantity) } for res, quantity := range node.Node().Status.Allocatable { allocatable[res] = append(allocatable[res], quantity) } requested := node.RequestedResource() for res, quantity := range (&requested).ResourceList() { freeRes := node.Node().Status.Allocatable[res].DeepCopy() freeRes.Sub(quantity) free[res] = append(free[res], freeRes) } } // For capacity we require exact match. // If this is ever changed, enforcing MaxCoresTotal and MaxMemoryTotal limits // as it is now may no longer work. for _, qtyList := range capacity { if len(qtyList) != 2 || qtyList[0].Cmp(qtyList[1]) != 0 { return false } } ...
Note that mcelog can offline memory pages based on error detection: http://www.mcelog.org/badpageofflining.html
We are still investigating the issue. Next step is to see if we are able to provision machines in AWS with publicly available images with various OSes through launch config to see if we can get the differences to role out if it's RHCOS image related. One of ways how to cope with the small difference in memory capacity is to tolerate up to 1% in difference.
I dare to say the same might be happening in https://bugzilla.redhat.com/show_bug.cgi?id=1633944
https://github.com/openshift/kubernetes-autoscaler/pull/113
https://github.com/openshift/kubernetes-autoscaler/pull/113 merged
Verified $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-08-26-202352 True False 23m Cluster version is 4.2.0-0.nightly-2019-08-26-202352 $ oc logs -f cluster-autoscaler-default-6c445d886b-xprqt I0827 08:28:01.466131 1 scale_up.go:265] 78 other pods are also unschedulable I0827 08:28:01.511036 1 scale_up.go:422] Best option to resize: openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b I0827 08:28:01.511080 1 scale_up.go:426] Estimated 49 nodes needed in openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b I0827 08:28:01.511236 1 scale_up.go:521] Splitting scale-up between 3 similar node groups: {openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b, openshift-machine-api/zhsun5-5b6hm-worker-us-east-2a, openshift-machine-api/zhsun5-5b6hm-worker-us-east-2c} I0827 08:28:01.511266 1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b 1->18 (max: 30)} {openshift-machine-api/zhsun5-5b6hm-worker-us-east-2a 1->17 (max: 30)} {openshift-machine-api/zhsun5-5b6hm-worker-us-east-2c 1->17 (max: 30)}] I0827 08:28:01.511311 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b size to 18 I0827 08:28:01.523715 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun5-5b6hm-worker-us-east-2a size to 17 I0827 08:28:01.538897 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun5-5b6hm-worker-us-east-2c size to 17
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922