Bug 1731011

Summary:	[CA] Sometimes "--balance-similar-node-groups" option doesn't work well
Product:	OpenShift Container Platform	Reporter:	sunzhaohua <zhsun>
Component:	Cloud Compute	Assignee:	Andrew McDermott <amcdermo>
Status:	CLOSED ERRATA	QA Contact:	sunzhaohua <zhsun>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.2.0	CC:	agarcial, amcdermo, bperkins, clasohm, jchaloup, jhou, mdhanve, rkrawitz
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:29:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1733235
Bug Blocks:

Description sunzhaohua 2019-07-18 06:53:18 UTC

Description of problem:
 "--balance-similar-node-groups" option doesn't work well. If I have 3 groups, sometimes balanced in 1 group, sometimes 2 group, sometimes 3. 

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-07-15-231921

How reproducible:
sometimes

Steps to Reproduce:
1. Create clusterautoscaler, set "balanceSimilarNodeGroups: true"
2. Create 3 machineautoscaler
$ oc get machineautoscaler
NAME                                 REF KIND     REF NAME                             MIN   MAX   AGE
zhsun-0716-wmwvm-worker-us-east-2a   MachineSet   zhsun-0716-wmwvm-worker-us-east-2a   1     30    31m
zhsun-0716-wmwvm-worker-us-east-2b   MachineSet   zhsun-0716-wmwvm-worker-us-east-2b   1     30    31m
zhsun-0716-wmwvm-worker-us-east-2c   MachineSet   zhsun-0716-wmwvm-worker-us-east-2c   1     30    31m
3.Add workload
$ oc create -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  generateName: work-queue-
spec:
  template:
    spec:
      containers:
      - name: work
        image: busybox
        command: ["sleep",  "86400"]
        resources:
          requests:
            memory: 500Mi
            cpu: 500m
      restartPolicy: Never
      nodeSelector:
        node-role.kubernetes.io/worker: ""
  backoffLimit: 4
  completions: 100
  parallelism: 100
EOF 

Actual results:
Balance only in 1 group or 2 groups.

1 group
I0718 04:50:30.779752       1 scale_up.go:426] Estimated 35 nodes needed in openshift-machine-api/zhsun-0715-wljql-worker-us-east-2b
I0718 04:50:30.779838       1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun-0715-wljql-worker-us-east-2b 1->30 (max: 30)}]
I0718 04:50:30.779861       1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun-0715-wljql-worker-us-east-2b size to 30
I0718 04:50:40.819094       1 scale_up.go:262] Pod openshift-machine-config-operator/etcd-quorum-guard-65994dbd87-qf4d5 is unschedulable
I0718 04:50:40.819127       1 scale_up.go:262] Pod openshift-machine-api/work-queue-7mprl-mr57n is unschedulable
...
I0718 04:50:40.819298       1 scale_up.go:265] 75 other pods are also unschedulable
I0718 04:50:40.851359       1 scale_up.go:422] Best option to resize: openshift-machine-api/zhsun-0715-wljql-worker-us-east-2c
I0718 04:50:40.851389       1 scale_up.go:426] Estimated 18 nodes needed in openshift-machine-api/zhsun-0715-wljql-worker-us-east-2c
I0718 04:50:40.851504       1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun-0715-wljql-worker-us-east-2c 1->19 (max: 30)}]
I0718 04:50:40.851539       1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun-0715-wljql-worker-us-east-2c size to 19

2 groups
I0718 05:39:04.171776       1 scale_up.go:422] Best option to resize: openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2c
I0718 05:39:04.171813       1 scale_up.go:426] Estimated 49 nodes needed in openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2c
I0718 05:39:04.171957       1 scale_up.go:521] Splitting scale-up between 2 similar node groups: {openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2c, openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2a}
I0718 05:39:04.172318       1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2c 1->26 (max: 30)} {openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2a 1->25 (max: 30)}]
I0718 05:39:04.172358       1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2c size to 26
I0718 05:39:04.196763       1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun-0716-wmwvm-worker-us-east-2a size to 25

Expected results:
Balance in 3 groups.

Additional info:

Comment 1 Andrew McDermott 2019-07-25 09:04:33 UTC

Before you run the autoscaler test please could you:

- scale down the replicas in the '2a' machineset to 0
- scale it up to 1 again
- wait for the new machine/node to become Ready
- and then start the autoscaler test

Comment 2 Andrew McDermott 2019-07-25 13:47:36 UTC

This occurs because the memory capacity of one node is different to the capacity in the other nodes.

See: https://bugzilla.redhat.com/show_bug.cgi?id=1733235

After installation I see that one of the worker nodes reports a different amount of RAM compared to the other two.

If you scale down the machineset to 0, then back to 1, _AND_ after it becomes "Ready" _AND_ it reports the same amount of RAM as the other nodes then I do see the autoscaler report that it is "Splitting scale-up between 3 nodegroups".

The default cluster autoscaler logic for determining whether a nodegroup is similar to other nodegroups requires that capacity is identical:

https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L79

Comment 3 sunzhaohua 2019-07-25 14:54:32 UTC

@andrew I tried following the steps, the result is as expected.
- scale down the replicas in the '2a' machineset to 0
- scale it up to 1 again
- wait for the new machine/node to become Ready
- create clusterautoscaler, machineautoscaler
- add workload


I0725 14:49:14.483020       1 scale_up.go:426] Estimated 49 nodes needed in openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b
I0725 14:49:14.483183       1 scale_up.go:521] Splitting scale-up between 3 similar node groups: {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b, openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c, openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a}
I0725 14:49:14.483211       1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b 1->18 (max: 30)} {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c 1->17 (max: 30)} {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a 1->17 (max: 30)}]
I0725 14:49:14.483251       1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b size to 18
I0725 14:49:14.497758       1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c size to 17
I0725 14:49:14.511498       1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a size to 17

Comment 4 Andrew McDermott 2019-07-25 15:38:42 UTC

(In reply to sunzhaohua from comment #3)
> @andrew I tried following the steps, the result is as expected.
> - scale down the replicas in the '2a' machineset to 0
> - scale it up to 1 again
> - wait for the new machine/node to become Ready
> - create clusterautoscaler, machineautoscaler
> - add workload
> 
> 
> I0725 14:49:14.483020       1 scale_up.go:426] Estimated 49 nodes needed in
> openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b
> I0725 14:49:14.483183       1 scale_up.go:521] Splitting scale-up between 3
> similar node groups: {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b,
> openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c,
> openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a}
> I0725 14:49:14.483211       1 scale_up.go:529] Final scale-up plan:
> [{openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b 1->18 (max: 30)}
> {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c 1->17 (max: 30)}
> {openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a 1->17 (max: 30)}]
> I0725 14:49:14.483251       1 scale_up.go:686] Scale-up: setting group
> openshift-machine-api/zhsun1-lpwf6-worker-us-east-2b size to 18
> I0725 14:49:14.497758       1 scale_up.go:686] Scale-up: setting group
> openshift-machine-api/zhsun1-lpwf6-worker-us-east-2c size to 17
> I0725 14:49:14.511498       1 scale_up.go:686] Scale-up: setting group
> openshift-machine-api/zhsun1-lpwf6-worker-us-east-2a size to 17

The problem still remains that we are getting machines/nodes with varying amounts of RAM - which to me is unexpected, see: https://bugzilla.redhat.com/show_bug.cgi?id=1733235

As far as this bug goes I think the behaviour is correct, thus this is not a bug.

If the cluster autoscaler requires memory capacity to be identical when balancing amongst nodes in nodegroups then it can only split the workload up amongst those that are equal. We're seeing the correct behaviour from the cluster autoscaler here.

Comment 5 Andrew McDermott 2019-07-26 13:09:20 UTC

I looked at adding a 3-5% toleration on the capacity but am now wary of making this change based on the warning here:

https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L79

This says:
	
// If this is ever changed, enforcing MaxCoresTotal and MaxMemoryTotal limits
// as it is now may no longer work.

from the following function:

func IsNodeInfoSimilar(n1, n2 *schedulernodeinfo.NodeInfo) bool {
	capacity := make(map[apiv1.ResourceName][]resource.Quantity)
	allocatable := make(map[apiv1.ResourceName][]resource.Quantity)
	free := make(map[apiv1.ResourceName][]resource.Quantity)
	nodes := []*schedulernodeinfo.NodeInfo{n1, n2}
	for _, node := range nodes {
		for res, quantity := range node.Node().Status.Capacity {
			capacity[res] = append(capacity[res], quantity)
		}
		for res, quantity := range node.Node().Status.Allocatable {
			allocatable[res] = append(allocatable[res], quantity)
		}
		requested := node.RequestedResource()
		for res, quantity := range (&requested).ResourceList() {
			freeRes := node.Node().Status.Allocatable[res].DeepCopy()
			freeRes.Sub(quantity)
			free[res] = append(free[res], freeRes)
		}
	}
	// For capacity we require exact match.
	// If this is ever changed, enforcing MaxCoresTotal and MaxMemoryTotal limits
	// as it is now may no longer work.
	for _, qtyList := range capacity {
		if len(qtyList) != 2 || qtyList[0].Cmp(qtyList[1]) != 0 {
			return false
		}
	}
...

Comment 6 Robert Krawitz 2019-07-26 14:53:05 UTC

Note that mcelog can offline memory pages based on error detection: http://www.mcelog.org/badpageofflining.html

Comment 7 Jan Chaloupka 2019-08-13 12:42:10 UTC

We are still investigating the issue. Next step is to see if we are able to provision machines in AWS with publicly available images with various OSes through launch config to see if we can get the differences to role out if it's RHCOS image related. One of ways how to cope with the small difference in memory capacity is to tolerate up to 1% in difference.

Comment 8 Jan Chaloupka 2019-08-19 11:04:56 UTC

I dare to say the same might be happening in https://bugzilla.redhat.com/show_bug.cgi?id=1633944

Comment 9 Andrew McDermott 2019-08-23 14:18:04 UTC

https://github.com/openshift/kubernetes-autoscaler/pull/113

Comment 10 Jan Chaloupka 2019-08-26 10:58:06 UTC

https://github.com/openshift/kubernetes-autoscaler/pull/113 merged

Comment 12 sunzhaohua 2019-08-27 08:43:00 UTC

Verified

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-26-202352   True        False         23m     Cluster version is 4.2.0-0.nightly-2019-08-26-202352

$ oc logs -f cluster-autoscaler-default-6c445d886b-xprqt
I0827 08:28:01.466131       1 scale_up.go:265] 78 other pods are also unschedulable
I0827 08:28:01.511036       1 scale_up.go:422] Best option to resize: openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b
I0827 08:28:01.511080       1 scale_up.go:426] Estimated 49 nodes needed in openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b
I0827 08:28:01.511236       1 scale_up.go:521] Splitting scale-up between 3 similar node groups: {openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b, openshift-machine-api/zhsun5-5b6hm-worker-us-east-2a, openshift-machine-api/zhsun5-5b6hm-worker-us-east-2c}
I0827 08:28:01.511266       1 scale_up.go:529] Final scale-up plan: [{openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b 1->18 (max: 30)} {openshift-machine-api/zhsun5-5b6hm-worker-us-east-2a 1->17 (max: 30)} {openshift-machine-api/zhsun5-5b6hm-worker-us-east-2c 1->17 (max: 30)}]
I0827 08:28:01.511311       1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun5-5b6hm-worker-us-east-2b size to 18
I0827 08:28:01.523715       1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun5-5b6hm-worker-us-east-2a size to 17
I0827 08:28:01.538897       1 scale_up.go:686] Scale-up: setting group openshift-machine-api/zhsun5-5b6hm-worker-us-east-2c size to 17

Comment 13 errata-xmlrpc 2019-10-16 06:29:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922