Bug 1731011
Summary: | [CA] Sometimes "--balance-similar-node-groups" option doesn't work well | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | sunzhaohua <zhsun> |
Component: | Cloud Compute | Assignee: | Andrew McDermott <amcdermo> |
Status: | CLOSED ERRATA | QA Contact: | sunzhaohua <zhsun> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.2.0 | CC: | agarcial, amcdermo, bperkins, clasohm, jchaloup, jhou, mdhanve, rkrawitz |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 4.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-10-16 06:29:52 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1733235 | ||
Bug Blocks: |
Description
sunzhaohua
2019-07-18 06:53:18 UTC
Before you run the autoscaler test please could you: - scale down the replicas in the '2a' machineset to 0 - scale it up to 1 again - wait for the new machine/node to become Ready - and then start the autoscaler test This occurs because the memory capacity of one node is different to the capacity in the other nodes. See: https://bugzilla.redhat.com/show_bug.cgi?id=1733235 After installation I see that one of the worker nodes reports a different amount of RAM compared to the other two. If you scale down the machineset to 0, then back to 1, _AND_ after it becomes "Ready" _AND_ it reports the same amount of RAM as the other nodes then I do see the autoscaler report that it is "Splitting scale-up between 3 nodegroups". The default cluster autoscaler logic for determining whether a nodegroup is similar to other nodegroups requires that capacity is identical: https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L79 @andrew I tried following the steps, the result is as expected. - scale down the replicas in the '2a' machineset to 0 - scale it up to 1 again - wait for the new machine/node to become Ready - create clusterautoscaler, machineautoscaler - add workload I0725 14:49:14.483020 1 scale_up.go:426] Estimated 49 nodes needed in openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b I0725 14:49:14.483183 1 scale_up.go:521] Splitting scale-up between 3 similar node groups: {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b, openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c, openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a} I0725 14:49:14.483211 1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b 1->18 (max: 30)} {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c 1->17 (max: 30)} {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a 1->17 (max: 30)}] I0725 14:49:14.483251 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b size to 18 I0725 14:49:14.497758 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c size to 17 I0725 14:49:14.511498 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a size to 17 (In reply to sunzhaohua from comment #3) > @andrew I tried following the steps, the result is as expected. > - scale down the replicas in the '2a' machineset to 0 > - scale it up to 1 again > - wait for the new machine/node to become Ready > - create clusterautoscaler, machineautoscaler > - add workload > > > I0725 14:49:14.483020 1 scale_up.go:426] Estimated 49 nodes needed in > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b > I0725 14:49:14.483183 1 scale_up.go:521] Splitting scale-up between 3 > similar node groups: {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b, > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c, > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a} > I0725 14:49:14.483211 1 scale_up.go:529] Final scale-up plan: > [{openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b 1->18 (max: 30)} > {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c 1->17 (max: 30)} > {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a 1->17 (max: 30)}] > I0725 14:49:14.483251 1 scale_up.go:686] Scale-up: setting group > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b size to 18 > I0725 14:49:14.497758 1 scale_up.go:686] Scale-up: setting group > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c size to 17 > I0725 14:49:14.511498 1 scale_up.go:686] Scale-up: setting group > openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a size to 17 The problem still remains that we are getting machines/nodes with varying amounts of RAM - which to me is unexpected, see: https://bugzilla.redhat.com/show_bug.cgi?id=1733235 As far as this bug goes I think the behaviour is correct, thus this is not a bug. If the cluster autoscaler requires memory capacity to be identical when balancing amongst nodes in nodegroups then it can only split the workload up amongst those that are equal. We're seeing the correct behaviour from the cluster autoscaler here. I looked at adding a 3-5% toleration on the capacity but am now wary of making this change based on the warning here: https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L79 This says: // If this is ever changed, enforcing MaxCoresTotal and MaxMemoryTotal limits // as it is now may no longer work. from the following function: func IsNodeInfoSimilar(n1, n2 *schedulernodeinfo.NodeInfo) bool { capacity := make(map[apiv1.ResourceName][]resource.Quantity) allocatable := make(map[apiv1.ResourceName][]resource.Quantity) free := make(map[apiv1.ResourceName][]resource.Quantity) nodes := []*schedulernodeinfo.NodeInfo{n1, n2} for _, node := range nodes { for res, quantity := range node.Node().Status.Capacity { capacity[res] = append(capacity[res], quantity) } for res, quantity := range node.Node().Status.Allocatable { allocatable[res] = append(allocatable[res], quantity) } requested := node.RequestedResource() for res, quantity := range (&requested).ResourceList() { freeRes := node.Node().Status.Allocatable[res].DeepCopy() freeRes.Sub(quantity) free[res] = append(free[res], freeRes) } } // For capacity we require exact match. // If this is ever changed, enforcing MaxCoresTotal and MaxMemoryTotal limits // as it is now may no longer work. for _, qtyList := range capacity { if len(qtyList) != 2 || qtyList[0].Cmp(qtyList[1]) != 0 { return false } } ... Note that mcelog can offline memory pages based on error detection: http://www.mcelog.org/badpageofflining.html We are still investigating the issue. Next step is to see if we are able to provision machines in AWS with publicly available images with various OSes through launch config to see if we can get the differences to role out if it's RHCOS image related. One of ways how to cope with the small difference in memory capacity is to tolerate up to 1% in difference. I dare to say the same might be happening in https://bugzilla.redhat.com/show_bug.cgi?id=1633944 Verified $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-08-26-202352 True False 23m Cluster version is 4.2.0-0.nightly-2019-08-26-202352 $ oc logs -f cluster-autoscaler-default-6c445d886b-xprqt I0827 08:28:01.466131 1 scale_up.go:265] 78 other pods are also unschedulable I0827 08:28:01.511036 1 scale_up.go:422] Best option to resize: openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b I0827 08:28:01.511080 1 scale_up.go:426] Estimated 49 nodes needed in openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b I0827 08:28:01.511236 1 scale_up.go:521] Splitting scale-up between 3 similar node groups: {openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b, openshift-machine-api/zhsun5-5b6hm-worker-us-east-2a, openshift-machine-api/zhsun5-5b6hm-worker-us-east-2c} I0827 08:28:01.511266 1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b 1->18 (max: 30)} {openshift-machine-api/zhsun5-5b6hm-worker-us-east-2a 1->17 (max: 30)} {openshift-machine-api/zhsun5-5b6hm-worker-us-east-2c 1->17 (max: 30)}] I0827 08:28:01.511311 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b size to 18 I0827 08:28:01.523715 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun5-5b6hm-worker-us-east-2a size to 17 I0827 08:28:01.538897 1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun5-5b6hm-worker-us-east-2c size to 17 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |