Bug 1733235

Summary: Installed worker nodes/machines have different amounts of memory
Product: OpenShift Container Platform Reporter: Andrew McDermott <amcdermo>
Component: Cloud ComputeAssignee: Andrew McDermott <amcdermo>
Status: CLOSED NOTABUG QA Contact: Jianwei Hou <jhou>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: agarcial, dhardie, gblomqui, jchaloup, mdhanve, rkrawitz
Target Milestone: ---Keywords: Reopened
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-06 13:05:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1731011    

Description Andrew McDermott 2019-07-25 13:40:58 UTC
Installing on either Azure or AWS I see that worker nodes have different amounts of RAM as reported by node 'capacity'.

If I scale up the available machinesets:

oc patch -n openshift-machine-api machineset/amcdermo-st695-worker-us-east-2a -p '{"spec":{"replicas":10}}' --type=merge;
oc patch -n openshift-machine-api machineset/amcdermo-st695-worker-us-east-2b -p '{"spec":{"replicas":10}}' --type=merge;
oc patch -n openshift-machine-api machineset/amcdermo-st695-worker-us-east-2c -p '{"spec":{"replicas":10}}' --type=merge;

and wait for the nodes to become "Ready" then looking at the reported memory capacity for each node I see different amounts. I was expecting the worker nodes to report identical values.

$ oc get nodes -o json | jq '.items[].status.capacity["memory"]' | sort | uniq -c
      3 "16420436Ki" # Masters
     16 "8162892Ki"
     14 "8162900Ki"

Doing the same on Azure I see:

$ oc patch -n openshift-machine-api machineset/amcdermo-190725-t9l84-worker-centralus1 -p '{"spec":{"replicas":10}}' --type=merge;
machineset.machine.openshift.io/amcdermo-190725-t9l84-worker-centralus1 patched
aim@spicy:~/go-projects/openshift-installer/src/github.com/openshift/installer
$ oc patch -n openshift-machine-api machineset/amcdermo-190725-t9l84-worker-centralus2 -p '{"spec":{"replicas":10}}' --type=merge;
machineset.machine.openshift.io/amcdermo-190725-t9l84-worker-centralus2 patched
aim@spicy:~/go-projects/openshift-installer/src/github.com/openshift/installer
$ oc patch -n openshift-machine-api machineset/amcdermo-190725-t9l84-worker-centralus3 -p '{"spec":{"replicas":10}}' --type=merge;
machineset.machine.openshift.io/amcdermo-190725-t9l84-worker-centralus3 patched

$ oc get nodes -o json | jq '.items[].status.capacity["memory"]' | sort | uniq  -c
      3 "16399092Ki" # Masters
      2 "8141552Ki"
     27 "8141560Ki"
      1 "8141752Ki"


Sometimes I see identical values reported after installation completes, but that tends to happen way more often on Auzre than AWS right now. I tripped over the discrepancy, post initial installation, whilst investigating:

https://bugzilla.redhat.com/show_bug.cgi?id=1731011

$ ./bin/openshift-install version
./bin/openshift-install unreleased-master-1404-ge090d19dda2ba9d80dba652e168fcc8ed54f1b55
built from commit e090d19dda2ba9d80dba652e168fcc8ed54f1b55
release image registry.svc.ci.openshift.org/origin/release:4.2


(Apologies if this proves to be filed against the wrong domain/component.)

Comment 1 Andrew McDermott 2019-07-25 14:30:22 UTC
Note: when a node/machine reports differing amount of memory the actual amount reported is derived from /proc/meminfo. If you ssh to each worker node then you can see machines in the same machineset can report different values in /proc/meminfo.

Comment 2 Robert Krawitz 2019-07-25 15:32:19 UTC
Andrew, if you do 'oc adm node-logs' (or dmesg on the node), take a look for the BIOS-provided memory map.  Is the memory map from the BIOS identical across nodes reporting different amount of memory?

Comment 3 Andrew McDermott 2019-07-25 16:04:46 UTC
(In reply to Robert Krawitz from comment #2)
> Andrew, if you do 'oc adm node-logs' (or dmesg on the node), take a look for
> the BIOS-provided memory map.  Is the memory map from the BIOS identical
> across nodes reporting different amount of memory?

Right now I have:

$ oc get nodes -o json | jq '.items[].status.capacity["memory"]' | sort | uniq -c
      3 "16420436Ki"
      1 "8162892Ki"
      2 "8162900Ki"

$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-139-28.us-east-2.compute.internal    Ready    master   6h26m   v1.14.0+bd34733a7
ip-10-0-143-189.us-east-2.compute.internal   Ready    worker   109m    v1.14.0+bd34733a7
ip-10-0-155-36.us-east-2.compute.internal    Ready    worker   109m    v1.14.0+bd34733a7
ip-10-0-157-15.us-east-2.compute.internal    Ready    master   6h26m   v1.14.0+bd34733a7
ip-10-0-167-148.us-east-2.compute.internal   Ready    worker   7m2s    v1.14.0+bd34733a7
ip-10-0-170-40.us-east-2.compute.internal    Ready    master   6h26m   v1.14.0+bd34733a7


$ ssh ip-10-0-143-189.us-east-2.compute.internal dmesg |grep Mem
[    0.000000] Memory: 3937920K/8388212K available (12292K kernel code, 2101K rwdata, 3816K rodata, 2356K init, 3320K bss, 289296K reserved, 0K cma-reserved)
[    0.201067] x86/mm: Memory block size: 128MB

$ ssh ip-10-0-143-189.us-east-2.compute.internal cat /proc/meminfo | egrep 'Mem[Total|Free|Available]'
MemTotal:        8162900 kB
MemFree:         4470312 kB
MemAvailable:    7003856 kB




$ ssh ip-10-0-155-36.us-east-2.compute.internal dmesg |grep Mem
[    0.000000] Memory: 3937920K/8388212K available (12292K kernel code, 2101K rwdata, 3816K rodata, 2356K init, 3320K bss, 289296K reserved, 0K cma-reserved)
[    0.237069] x86/mm: Memory block size: 128MB

$ ssh ip-10-0-155-36.us-east-2.compute.internal cat /proc/meminfo | egrep 'Mem[Total|Free|Available]'
MemTotal:        8162900 kB
MemFree:         3576548 kB
MemAvailable:    6200228 kB




$ ssh ip-10-0-167-148.us-east-2.compute.internal dmesg |grep Mem
[    0.000000] Memory: 3908240K/8388212K available (12292K kernel code, 2101K rwdata, 3816K rodata, 2356K init, 3320K bss, 289304K reserved, 0K cma-reserved)
[    0.206061] x86/mm: Memory block size: 128MB

$ ssh ip-10-0-167-148.us-east-2.compute.internal cat /proc/meminfo | egrep 'Mem[Total|Free|Available]'
MemTotal:        8162892 kB
MemFree:         4883292 kB
MemAvailable:    6526712 kB

Note the last node has 8162892 whereas the other worker nodes both have 8162900. And the output from dmesg shows:

- Memory: 3937920K/8388212K
- Memory: 3937920K/8388212K
- Memory: 3908240K/8388212K

Also if you look at https://bugzilla.redhat.com/show_bug.cgi?id=1731011#c3 you can get to a point where they are all equal, irrespective (it seems) of the availability zone.

Comment 4 Robert Krawitz 2019-07-25 17:25:02 UTC
I want more detailed data than that, the entire memory map.  Something like this:

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009cfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009d000-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000030f42fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000030f43000-0x0000000043de3fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000043de4000-0x0000000043de4fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000043de5000-0x000000004ff29fff] reserved
[    0.000000] BIOS-e820: [mem 0x000000004ff2a000-0x000000004ffbefff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000004ffbf000-0x000000004fffefff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000004ffff000-0x0000000057ffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000058600000-0x000000005e7fffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fd000000-0x00000000fe7fffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed10000-0x00000000fed19fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed84000-0x00000000fed84fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000ff800000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000109f7fffff] usable

If they're not identical, there's probably not much you can do.

Comment 5 Andrew McDermott 2019-07-26 08:13:56 UTC
Gathering via:

$ echo $workers
ip-10-0-143-189.us-east-2.compute.internal ip-10-0-155-36.us-east-2.compute.internal ip-10-0-167-148.us-east-2.compute.
internal

$ type g
g is a function
g () 
{ 
    for i in $workers;
    do
        echo "Memory map for: $i";
        ssh -o StrictHostKeyChecking=no $i dmesg -t | egrep --color=auto 'BIOS' | tee /tmp/$i.meminfo;
        echo;
    done
}

$ diff3 ip-10-0-143-189.us-east-2.compute.internal ip-10-0-155-36.us-east-2.compute.internal ip-10-0-167-148.us-east-2.compute.internal
$ echo $?
0

And the output:


Memory map for: ip-10-0-143-189.us-east-2.compute.internal
BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
SMBIOS 2.7 present.
DMI: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
intel_idle: Please enable MWAIT in BIOS SETUP
piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr


Memory map for: ip-10-0-155-36.us-east-2.compute.internal
BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
SMBIOS 2.7 present.
DMI: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
intel_idle: Please enable MWAIT in BIOS SETUP
piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr


Memory map for: ip-10-0-167-148.us-east-2.compute.internal
BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
SMBIOS 2.7 present.
DMI: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
intel_idle: Please enable MWAIT in BIOS SETUP
piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr

Comment 6 Andrew McDermott 2019-07-26 08:15:44 UTC
(In reply to Andrew McDermott from comment #5)
> Gathering via:
> 
> $ echo $workers
> ip-10-0-143-189.us-east-2.compute.internal
> ip-10-0-155-36.us-east-2.compute.internal ip-10-0-167-148.us-east-2.compute.
> internal
> 
> $ type g
> g is a function
> g () 
> { 
>     for i in $workers;
>     do
>         echo "Memory map for: $i";
>         ssh -o StrictHostKeyChecking=no $i dmesg -t | egrep --color=auto
> 'BIOS' | tee /tmp/$i.meminfo;
>         echo;
>     done
> }
> 
> $ diff3 ip-10-0-143-189.us-east-2.compute.internal
> ip-10-0-155-36.us-east-2.compute.internal
> ip-10-0-167-148.us-east-2.compute.internal
> $ echo $?
> 0
> 
> And the output:
> 
> 
> Memory map for: ip-10-0-143-189.us-east-2.compute.internal
> BIOS-provided physical RAM map:
> BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
> BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
> BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
> BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
> BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
> BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
> SMBIOS 2.7 present.
> DMI: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> intel_idle: Please enable MWAIT in BIOS SETUP
> piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or
> use force_addr=0xaddr
> 
> 
> Memory map for: ip-10-0-155-36.us-east-2.compute.internal
> BIOS-provided physical RAM map:
> BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
> BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
> BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
> BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
> BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
> BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
> SMBIOS 2.7 present.
> DMI: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> intel_idle: Please enable MWAIT in BIOS SETUP
> piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or
> use force_addr=0xaddr
> 
> 
> Memory map for: ip-10-0-167-148.us-east-2.compute.internal
> BIOS-provided physical RAM map:
> BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
> BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
> BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
> BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
> BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
> BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
> SMBIOS 2.7 present.
> DMI: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> intel_idle: Please enable MWAIT in BIOS SETUP
> piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or
> use force_addr=0xaddr

Just correcting/confirming the function for gathering was actually:

[core@ssh-bastion-65fd55cb7f-bff4x tmp]$ type g
g is a function
g () 
{ 
    for i in $workers;
    do
        echo;
        echo "Memory map for: $i";
        ssh -o StrictHostKeyChecking=no $i dmesg -t | egrep --color=auto --color=auto 'BIOS' | tee /tmp/$i;
        echo;
    done
}

Comment 11 Andrew McDermott 2019-07-26 08:40:13 UTC
Trying to narrow this done a bit more grepping for either 'BIOS' or 'mem' gives:

[core@ssh-bastion-65fd55cb7f-bff4x ~]$ type g
g is a function
g () 
{ 
    for i in $workers;
    do
        echo;
        echo "Memory map for: $i";
        ssh -o StrictHostKeyChecking=no $i dmesg -t | egrep --color=auto --color=auto 'BIOS|mem' | tee /tmp/$i;
        echo;
    done
}

[core@ssh-bastion-65fd55cb7f-bff4x tmp]$ diff3 $workers
====3
1:16c
2:16c
  NODE_DATA(0) allocated [mem 0x20ffd6000-0x20fffffff]
3:16c
  NODE_DATA(0) allocated [mem 0x20ffd5000-0x20fffefff]
====3
1:57c
2:57c
  [TTM] Zone  kernel: Available graphics memory: 4081450 kiB
3:57c
  [TTM] Zone  kernel: Available graphics memory: 4081446 kiB

So perhaps this ^^ is the difference. Hmm.

Comment 12 Robert Krawitz 2019-07-26 14:53:59 UTC
Note that mcelog can offline individual pages based on soft error rates: http://www.mcelog.org/badpageofflining.html

Comment 15 Andrew McDermott 2019-08-05 14:40:33 UTC
I also see differences on GCP instances:

$ kubectl get nodes
NAME                                                    STATUS   ROLES    AGE     VERSION
amcder-9s2hb-m-0.c.openshift-gce-devel.internal         Ready    master   5h10m   v1.14.0+0261aa0df
amcder-9s2hb-m-1.c.openshift-gce-devel.internal         Ready    master   5h10m   v1.14.0+0261aa0df
amcder-9s2hb-m-2.c.openshift-gce-devel.internal         Ready    master   5h10m   v1.14.0+0261aa0df
amcder-9s2hb-w-a-g5cb7.c.openshift-gce-devel.internal   Ready    worker   4h56m   v1.14.0+0261aa0df
amcder-9s2hb-w-b-7sr2x.c.openshift-gce-devel.internal   Ready    worker   4h56m   v1.14.0+0261aa0df
amcder-9s2hb-w-c-s4xsq.c.openshift-gce-devel.internal   Ready    worker   4h56m   v1.14.0+0261aa0df

aim@spicy:~/go-projects/openshift-cluster-api/src/sigs.k8s.io/cluster-api
$ oc get nodes -o json | jq '.items[].status.capacity["memory"]' | sort | uniq -c
     11 "15389264Ki"
     21 "15389280Ki"
      1 "15389988Ki"
aim@spicy:~/go-projects/openshift-cluster-api/src/sigs.k8s.io/cluster-api

$ oc get machinesets --all-namespaces
NAMESPACE               NAME               DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   amcder-9s2hb-w-a   10        10        10      10          5h32m
openshift-machine-api   amcder-9s2hb-w-b   10        10        10      10          5h32m
openshift-machine-api   amcder-9s2hb-w-c   10        10        10      10          5h32m

Comment 18 Alberto 2019-11-06 13:05:33 UTC
This mem discrepancy was affecting the autoscaler which was fix by https://github.com/kubernetes/autoscaler/commit/e8b3c2a111eb9b3b4fe15b4d4081175267ff6d76#diff-dfab69cae1cc7f7d024593ed57d7371a.
I think we can assume the deviation as inherent to the cloud provider and close this, please reopen if relevant.