Bug 1824215 - Worker nodes have different amounts of memory [NEEDINFO]
Summary: Worker nodes have different amounts of memory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.z
Hardware: Unspecified
OS: Linux
medium
high
Target Milestone: ---
: 4.6.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks: 1846967
TreeView+ depends on / blocked
 
Reported: 2020-04-15 14:38 UTC by manisha
Modified: 2020-10-27 15:58 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Memory capacity of instances of the same type across different failure domains may not be exactly the same Consequence: The autoscaler determine the nodegroups are different and does not balance workloads across different failure domains Fix: Allow a 256MB tolerance on memory capacity across nodegroups/failure domains Result: Autoscaler is more likely to balance workloads across failure domains when using balancesimilarnodegroups
Clone Of:
Environment:
Last Closed: 2020-10-27 15:57:47 UTC
Target Upstream Version:
mfuruta: needinfo? (agarcial)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes-autoscaler pull 144 0 None closed BUG 1824215: Raise maximum memory capacity difference 2021-02-20 19:37:02 UTC
Github openshift kubernetes-autoscaler pull 152 0 None closed BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups 2021-02-20 19:37:02 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:58:06 UTC

Description manisha 2020-04-15 14:38:14 UTC
Description of problem:

The balanceSimilarNodeGroups of ClusterAutoscaler doesn't work when mem discrepancy between nodes > 128KB

Cluster version:

 $ oc get clusterversion
 NAME      VERSION   AVAILABLE   PROGRESSING SINCE   STATUS
 version   4.3.8     True        False       6h6m    Cluster version is 4.3.8


Steps to Reproduce:

Cu captured the memory descrepency as follows:

Created 3 machinesets and repeated the following steps.

Step1: Scale out the machinesets' replica to 1
Step2: Check the nodes' memory capacity
Step3: Scale in the machiensets' replica to 0

Reference to :  

1.Bug 1733235- Installed worker nodes/machines have different amounts of memory
2.Bug 1731011 -[CA]Sometimes"--balance-similar-node-groups" option doesn't work 
  well


Actual results:

Memory discrepancy reported up to 172016KB 


Additional info:

Comment 2 Joel Speed 2020-04-24 15:24:01 UTC
This is fixed upstream by this PR https://github.com/kubernetes/autoscaler/pull/2462 which changes the maximum memory difference to 256KB.

This change is already present in our 4.4.z branch (https://github.com/openshift/kubernetes-autoscaler/blob/release-4.4/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L36)

We could also backport the limit into the 4.3 branch, though I'm not sure if there are any implications of doing this, will need to investigate that further

Comment 6 sunzhaohua 2020-05-06 10:49:40 UTC
Failed QA

Test env: 4.3.0-0.nightly-2020-05-04-051714 on aws

1. created 3 new machineset with m5.xlarge, Memory discrepancy reported up to 172032KB 
$ oc get node -o yaml | grep "memory"
      memory: 7008428Ki
      memory: 8159404Ki
      message: kubelet has sufficient memory available
      memory: 15265964Ki
      memory: 16416940Ki
      message: kubelet has sufficient memory available
      memory: 14793144Ki
      memory: 15944120Ki
      message: kubelet has sufficient memory available
      memory: 15265964Ki
      memory: 16416940Ki
      message: kubelet has sufficient memory available
      memory: 14793144Ki
      memory: 15944120Ki
      message: kubelet has sufficient memory available
      memory: 7008436Ki
      memory: 8159412Ki
      message: kubelet has sufficient memory available
      memory: 14965176Ki
      memory: 16116152Ki
      message: kubelet has sufficient memory available
      memory: 7008428Ki
      memory: 8159404Ki
      message: kubelet has sufficient memory available
      memory: 15265964Ki
      memory: 16416940Ki
      message: kubelet has sufficient memory available

2. Create clusterautoscaler with "balanceSimilarNodeGroups: true"
3. Create 3 machineautoscaler usiing the new created machinesets
$ oc get machineautoscaler
NAME       REF KIND     REF NAME                                 MIN   MAX   AGE
worker-a   MachineSet   zhsun-0506432-wt796-worker-us-east-2aa   1     10    65m
worker-b   MachineSet   zhsun-0506432-wt796-worker-us-east-2bb   1     10    64m
worker-c   MachineSet   zhsun-0506432-wt796-worker-us-east-2cc   1     10    42m
3. Create workload to scaleup the cluster. 
4. Check machine, node and logs,balance only in 2 groups.
I0506 10:27:44.492413       1 scale_up.go:273] 10 other pods are also unschedulable
I0506 10:27:44.500886       1 scale_up.go:430] Best option to resize: openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc
I0506 10:27:44.500911       1 scale_up.go:434] Estimated 10 nodes needed in openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc
I0506 10:27:44.501043       1 scale_up.go:539] Final scale-up plan: [{openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc 1->10 (max: 10)}]
I0506 10:27:44.501080       1 scale_up.go:700] Scale-up: setting group openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc size to 10
I0506 10:27:54.530600       1 scale_up.go:270] Pod openshift-machine-api/scale-up-5d784b79fd-t5jm8 is unschedulable
I0506 10:27:54.530626       1 scale_up.go:270] Pod openshift-machine-api/scale-up-5d784b79fd-hcb8h is unschedulable
I0506 10:27:54.530633       1 scale_up.go:270] Pod openshift-machine-api/scale-up-5d784b79fd-9g7gz is unschedulable
I0506 10:27:54.532116       1 scale_up.go:430] Best option to resize: openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb
I0506 10:27:54.532141       1 scale_up.go:434] Estimated 1 nodes needed in openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb
I0506 10:27:54.532250       1 scale_up.go:531] Splitting scale-up between 2 similar node groups: {openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb, openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2aa}
I0506 10:27:54.532280       1 scale_up.go:539] Final scale-up plan: [{openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb 1->2 (max: 10)}]
I0506 10:27:54.532300       1 scale_up.go:700] Scale-up: setting group openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb size to 2
I0506 10:28:04.588905       1 static_autoscaler.go:334] No unschedulable pods

If memory discrepancy is small, will balance in 3 groups
      memory: 14793128Ki
      memory: 15944104Ki

      memory: 14793144Ki
      memory: 15944120Ki

      memory: 14793144Ki
      memory: 15944120Ki

I0506 10:19:23.150104       1 scale_up.go:430] Best option to resize: openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc
I0506 10:19:23.150127       1 scale_up.go:434] Estimated 10 nodes needed in openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc
I0506 10:19:23.150247       1 scale_up.go:531] Splitting scale-up between 3 similar node groups: {openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc, openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2aa, openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb}
I0506 10:19:23.150280       1 scale_up.go:539] Final scale-up plan: [{openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc 1->5 (max: 10)} {openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2aa 1->4 (max: 10)} {openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb 1->4 (max: 10)}]

Comment 7 Joel Speed 2020-05-06 11:12:13 UTC
@sunzhaohua Hey, is there a polarion linked to this that I can take a look at? I'd like to see more how the test case was set up so I can investigate why this failed

Did you happen to do a must-gather for the cluster when you tested? I'm finding it hard to work out from the memory lists posted above which nodes are in which groups to compare the actual differences, some of those machines have large differences between them so it would be good to clarify

Comment 11 sunzhaohua 2020-05-07 09:56:38 UTC
@Joel Speed 
Test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-20108
clusterversion: 4.5.0-0.nightly-2020-05-05-205255

Test steps:
1. update machineset setting "instanceType: m5.xlarge"
$ oc get machineset
NAME                                  DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunaws506-4ghhm-worker-us-east-2a   1         1         1       1           31h
zhsunaws506-4ghhm-worker-us-east-2b   1         1         1       1           31h
zhsunaws506-4ghhm-worker-us-east-2c   1         1         1       1           31h
$ oc get machine
NAME                                        PHASE     TYPE        REGION      ZONE         AGE
zhsunaws506-4ghhm-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   31h
zhsunaws506-4ghhm-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   31h
zhsunaws506-4ghhm-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   31h
zhsunaws506-4ghhm-worker-us-east-2a-zzd7q   Running   m5.xlarge   us-east-2   us-east-2a   15m
zhsunaws506-4ghhm-worker-us-east-2b-zj974   Running   m5.xlarge   us-east-2   us-east-2b   15m
zhsunaws506-4ghhm-worker-us-east-2c-tsxxl   Running   m5.xlarge   us-east-2   us-east-2c   42m
$ oc get node
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-134-184.us-east-2.compute.internal   Ready    worker   11m   v1.18.0-rc.1
ip-10-0-140-2.us-east-2.compute.internal     Ready    master   31h   v1.18.0-rc.1
ip-10-0-157-233.us-east-2.compute.internal   Ready    worker   11m   v1.18.0-rc.1
ip-10-0-158-45.us-east-2.compute.internal    Ready    master   31h   v1.18.0-rc.1
ip-10-0-164-148.us-east-2.compute.internal   Ready    master   31h   v1.18.0-rc.1
ip-10-0-171-149.us-east-2.compute.internal   Ready    worker   38m   v1.18.0-rc.1
$ oc get node | grep worker
ip-10-0-134-184.us-east-2.compute.internal   Ready    worker   11m   v1.18.0-rc.1
ip-10-0-157-233.us-east-2.compute.internal   Ready    worker   11m   v1.18.0-rc.1
ip-10-0-171-149.us-east-2.compute.internal   Ready    worker   38m   v1.18.0-rc.1
$ oc get node ip-10-0-134-184.us-east-2.compute.internal ip-10-0-157-233.us-east-2.compute.internal ip-10-0-171-149.us-east-2.compute.internal  -o yaml | grep "memory"
      memory: 14793144Ki
      memory: 15944120Ki

      memory: 14793128Ki
      memory: 15944104Ki

      memory: 14965176Ki
      memory: 16116152Ki

16116152Ki-15944120Ki=172048

2. Create clusterautoscaler with "balanceSimilarNodeGroups: true"
---
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  balanceSimilarNodeGroups: true
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    unneededTime: 10s
3. Create 3 machineautoscalers
---
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
  name: "worker-c"
  namespace: "openshift-machine-api"
spec:
  minReplicas: 1
  maxReplicas: 10
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: zhsunaws506-4ghhm-worker-us-east-2c

$ oc get machineautoscalers
NAME       REF KIND     REF NAME                              MIN   MAX   AGE
worker-a   MachineSet   zhsunaws506-4ghhm-worker-us-east-2a   1     10    48m
worker-b   MachineSet   zhsunaws506-4ghhm-worker-us-east-2b   1     10    48m
worker-c   MachineSet   zhsunaws506-4ghhm-worker-us-east-2c   1     10    47m
4. Create workload
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scale-up
  labels:
    app: scale-up
spec:
  replicas: 40
  selector:
    matchLabels:
      app: scale-up
  template:
    metadata:
      labels:
        app: scale-up
    spec:
      containers:
      - name: busybox
        image: docker.io/library/busybox
        resources:
          requests:
            memory: 4Gi
        command:
        - /bin/sh
        - "-c"
        - "echo 'this should be in the logs' && sleep 86400"
      terminationGracePeriodSeconds: 0
5. Check logs and machineset

I0507 09:38:53.409088       1 scale_up.go:324] 13 other pods are also unschedulable
I0507 09:38:55.819066       1 scale_up.go:452] Best option to resize: openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2a
I0507 09:38:55.819104       1 scale_up.go:456] Estimated 11 nodes needed in openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2a
I0507 09:38:56.406750       1 scale_up.go:562] Splitting scale-up between 2 similar node groups: {openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2a, openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2b}
I0507 09:38:56.807804       1 scale_up.go:570] Final scale-up plan: [{openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2a 1->7 (max: 10)} {openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2b 1->6 (max: 10)}]


$ oc get machineset
NAME                                  DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsunaws506-4ghhm-worker-us-east-2a   7         7         7       7           31h
zhsunaws506-4ghhm-worker-us-east-2b   6         6         6       6           31h
zhsunaws506-4ghhm-worker-us-east-2c   1         1         1       1           31h

will attach must-gather

Comment 12 Joel Speed 2020-05-07 17:06:40 UTC
I've spent some time looking at this again and have determined that this bug is in fact still present in 4.5.
I've changed this BZ to point to 4.5 and will introduce the fix into that version and then backport.

Units for the resources coming from real nodes were not matching up with the units set in the check for a difference.
Instead of allowing a 256MB delta, it only allowed a 256KB delta, which is much smaller than would be expected as a difference across real nodes on cloud providers.

Comment 15 Joel Speed 2020-05-18 10:06:42 UTC
Deferring to 4.6 while trying to agree on a strategy for fixing this upstream. Will backport to 4.5.z once we have agree on an approach to fix this issue

Comment 16 Alberto 2020-05-29 11:08:14 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1824215#c15 still applies. Tagging with upcomingSprint.

Comment 21 sunzhaohua 2020-06-15 02:13:41 UTC
Verified
clusterversion: 4.6.0-0.nightly-2020-06-12-084204

Test steps:
1. update machineset setting "instanceType: m5.xlarge"
$ oc get node | grep worker
ip-10-0-131-245.us-east-2.compute.internal   Ready    worker   27m   v1.18.3+2164959
ip-10-0-190-142.us-east-2.compute.internal   Ready    worker   27m   v1.18.3+2164959
ip-10-0-222-150.us-east-2.compute.internal   Ready    worker   23m   v1.18.3+2164959
$ oc get node ip-10-0-131-245.us-east-2.compute.internal ip-10-0-190-142.us-east-2.compute.internal ip-10-0-222-150.us-east-2.compute.internal -o yaml| grep "memory"
      memory: 14784824Ki
      memory: 15935800Ki

      memory: 14956872Ki
      memory: 16107848Ki

      memory: 14784824Ki
      memory: 15935800Ki

16107848Ki-15935800Ki=172048

2. Create clusterautoscaler with "balanceSimilarNodeGroups: true"
---
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  balanceSimilarNodeGroups: true
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s
    unneededTime: 10s
3. Create 3 machineautoscalers
$ oc get machineautoscaler
NAME       REF KIND     REF NAME                              MIN   MAX   AGE
worker-a   MachineSet   zhsun615aws-wrjpw-worker-us-east-2a   1     10    45s
worker-b   MachineSet   zhsun615aws-wrjpw-worker-us-east-2b   1     10    29s
worker-c   MachineSet   zhsun615aws-wrjpw-worker-us-east-2c   1     10    9s

4. Create workload

5. Check logs and machineset

I0615 02:09:08.056685       1 scale_up.go:324] 13 other pods are also unschedulable
I0615 02:09:10.431324       1 scale_up.go:452] Best option to resize: openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2c
I0615 02:09:10.431372       1 scale_up.go:456] Estimated 11 nodes needed in openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2c
I0615 02:09:11.018039       1 scale_up.go:562] Splitting scale-up between 3 similar node groups: {openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2c, openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2a, openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2b}
I0615 02:09:11.617622       1 scale_up.go:570] Final scale-up plan: [{openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2c 1->5 (max: 10)} {openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2a 1->5 (max: 10)} {openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2b 1->4 (max: 10)}]
I0615 02:09:11.617669       1 scale_up.go:659] Scale-up: setting group openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2c size to 5
I0615 02:09:12.228972       1 scale_up.go:659] Scale-up: setting group openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2a size to 5
I0615 02:09:12.823869       1 scale_up.go:659] Scale-up: setting group openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2b size to 4

Comment 25 Masaki Furuta ( RH ) 2020-06-18 08:46:30 UTC
# I think this bz have to be public due to connect to upstream github, so I copied and noted this comment from my update on comment #24 for visibility without customer's name. 

(In reply to Alberto from comment #23)

Hello Alberto Garcia Lamela,

I am sorry for jumping in,, I am RH OCP TAM for my customer.

As Red Hat already knows, this affects customer's project which uses v4.4+, therefore this is a big problem for my TAM customer.
( This has a really huge impact on my customer, as we already have this Support Exception ,  based on request from customer's executive level discussion among RH PMs. )

Since we had no workaround nor mitigation for this yet, so please backport fix as sooner as possible.

I am grateful for your help.

Thank you,

BR,
Masaki

Comment 26 manisha 2020-07-01 09:39:04 UTC
Hi Team,

Cu would like to know the current status of backporting to 4.5 and 4.4.
Please let me know if we have any information reg. this which we could pass onto the customer.

Many thanks,
Manisha

Comment 27 Joel Speed 2020-07-01 10:30:50 UTC
Since this issue is not being considered a release blocker, we will have to wait for 4.5 to release before we can merge it into a 4.5.z stream, this does also block backporting to 4.4 as well so this won't be available until a few weeks after 4.5 is released

Comment 29 errata-xmlrpc 2020-10-27 15:57:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.