Bug 1876956

Summary: Reported topology doesn't match actual host topology
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Roni <reliezer>
Component: libvirtAssignee: Michal Privoznik <mprivozn>
Status: CLOSED ERRATA QA Contact: Dan Zheng <dzheng>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 8.3CC: aefrat, ahadas, dholler, dzheng, ehadley, emesika, jdenemar, jmacku, jsuchane, kchamart, khakimi, mavital, mburman, mhou, michal.skrivanek, mperina, mprivozn, mzamazal, virt-maint
Target Milestone: rcKeywords: Automation, AutomationBlocker, Regression, Upstream
Target Release: 8.3   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-6.6.0-5.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-17 17:51:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Roni 2020-09-08 15:07:45 UTC
Created attachment 1714146 [details]
logs

Created attachment 1714146 [details]
logs

Description of problem:
hosted-engine setup fails on some environments

Version-Release number of selected component (if applicable):
4.4.3.1-0.7.el8ev

How reproducible:
50%

Steps to Reproduce:
1. Deploy hosted-engine using ansible (ovirt-ansible-hosted-engine-setup)
2.
3.

Actual results:
TASK [ovirt.hosted_engine_setup : Fail with error description]
fatal: [lynx01.lab.eng.tlv2.redhat.com]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, deployment errors:   code 505: Host host_mixed_1 installation failed. Failed to configure management network on the host.,    code 9000: Failed to verify Power Management configuration for Host host_mixed_1.,   fix accordingly and re-deploy."}

Expected results:
hosted-engine should successfully be deployed

Additional info:
It does not reproduce on environments named: GE-6/8/9, 
but reproduce on all others

Comment 9 Avihai 2020-09-10 08:44:54 UTC
Raising to urgent priority as this is blocking NUMA nodes installations and overloading QE automation regression suits as we use multiple ENV's with NUMA hosts which now are broken due to this issue.

Comment 26 Martin Perina 2020-09-10 11:06:51 UTC
Arik, could someone from virt team take a look? NUMA was originally an SLA team feature and unfortunately we don't know pretty much anything about it in infra

Comment 29 Arik 2020-09-10 20:06:50 UTC
due to the reported distances:
'numaNodeDistance': {'0': [10, 40]}

the engine assumes the host has 2 numa nodes but only one node is reported by vdsm:
'numaNodes': {'0': {'totalMemory': '63184', 'hugepages': {'64': {'totalPages': '440509'}, '2048': {'totalPages': '512'}, '1048576': {'totalPages': '2'}}, 'cpus': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}},

that corresponds to what libvirt reports in the 'capabilities':
    <topology>
      <cells num='1'>
        <cell id='0'>
          <memory unit='KiB'>31338304</memory>
          <pages unit='KiB' size='64'>440509</pages>
          <pages unit='KiB' size='2048'>512</pages>
          <pages unit='KiB' size='1048576'>2</pages>
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='8' value='40'/>
          </distances>
          <cpus num='64'>
            <cpu id='0' socket_id='0' die_id='0' core_id='0' siblings='0-3'/>
            <cpu id='1' socket_id='0' die_id='0' core_id='0' siblings='0-3'/>
            <cpu id='2' socket_id='0' die_id='0' core_id='0' siblings='0-3'/>
            <cpu id='3' socket_id='0' die_id='0' core_id='0' siblings='0-3'/>
            <cpu id='4' socket_id='0' die_id='0' core_id='4' siblings='4-7'/>
            <cpu id='5' socket_id='0' die_id='0' core_id='4' siblings='4-7'/>
            <cpu id='6' socket_id='0' die_id='0' core_id='4' siblings='4-7'/>
            <cpu id='7' socket_id='0' die_id='0' core_id='4' siblings='4-7'/>
            <cpu id='8' socket_id='0' die_id='0' core_id='8' siblings='8-11'/>
            <cpu id='9' socket_id='0' die_id='0' core_id='8' siblings='8-11'/>
            <cpu id='10' socket_id='0' die_id='0' core_id='8' siblings='8-11'/>
            <cpu id='11' socket_id='0' die_id='0' core_id='8' siblings='8-11'/>
            <cpu id='12' socket_id='0' die_id='0' core_id='12' siblings='12-15'/>
            <cpu id='13' socket_id='0' die_id='0' core_id='12' siblings='12-15'/>
            <cpu id='14' socket_id='0' die_id='0' core_id='12' siblings='12-15'/>
            <cpu id='15' socket_id='0' die_id='0' core_id='12' siblings='12-15'/>
            <cpu id='16' socket_id='0' die_id='0' core_id='24' siblings='16-19'/>
            <cpu id='17' socket_id='0' die_id='0' core_id='24' siblings='16-19'/>
            <cpu id='18' socket_id='0' die_id='0' core_id='24' siblings='16-19'/>
            <cpu id='19' socket_id='0' die_id='0' core_id='24' siblings='16-19'/>
            <cpu id='20' socket_id='0' die_id='0' core_id='28' siblings='20-23'/>
            <cpu id='21' socket_id='0' die_id='0' core_id='28' siblings='20-23'/>
            <cpu id='22' socket_id='0' die_id='0' core_id='28' siblings='20-23'/>
            <cpu id='23' socket_id='0' die_id='0' core_id='28' siblings='20-23'/>
            <cpu id='24' socket_id='0' die_id='0' core_id='40' siblings='24-27'/>
            <cpu id='25' socket_id='0' die_id='0' core_id='40' siblings='24-27'/>
            <cpu id='26' socket_id='0' die_id='0' core_id='40' siblings='24-27'/>
            <cpu id='27' socket_id='0' die_id='0' core_id='40' siblings='24-27'/>
            <cpu id='28' socket_id='0' die_id='0' core_id='44' siblings='28-31'/>
            <cpu id='29' socket_id='0' die_id='0' core_id='44' siblings='28-31'/>
            <cpu id='30' socket_id='0' die_id='0' core_id='44' siblings='28-31'/>
            <cpu id='31' socket_id='0' die_id='0' core_id='44' siblings='28-31'/>
            <cpu id='32' socket_id='0' die_id='0' core_id='48' siblings='32-35'/>
            <cpu id='33' socket_id='0' die_id='0' core_id='48' siblings='32-35'/>
            <cpu id='34' socket_id='0' die_id='0' core_id='48' siblings='32-35'/>
            <cpu id='35' socket_id='0' die_id='0' core_id='48' siblings='32-35'/>
            <cpu id='36' socket_id='0' die_id='0' core_id='52' siblings='36-39'/>
            <cpu id='37' socket_id='0' die_id='0' core_id='52' siblings='36-39'/>
            <cpu id='38' socket_id='0' die_id='0' core_id='52' siblings='36-39'/>
            <cpu id='39' socket_id='0' die_id='0' core_id='52' siblings='36-39'/>
            <cpu id='40' socket_id='0' die_id='0' core_id='56' siblings='40-43'/>
            <cpu id='41' socket_id='0' die_id='0' core_id='56' siblings='40-43'/>
            <cpu id='42' socket_id='0' die_id='0' core_id='56' siblings='40-43'/>
            <cpu id='43' socket_id='0' die_id='0' core_id='56' siblings='40-43'/>
            <cpu id='44' socket_id='0' die_id='0' core_id='60' siblings='44-47'/>
            <cpu id='45' socket_id='0' die_id='0' core_id='60' siblings='44-47'/>
            <cpu id='46' socket_id='0' die_id='0' core_id='60' siblings='44-47'/>
            <cpu id='47' socket_id='0' die_id='0' core_id='60' siblings='44-47'/>
            <cpu id='48' socket_id='0' die_id='0' core_id='72' siblings='48-51'/>
            <cpu id='49' socket_id='0' die_id='0' core_id='72' siblings='48-51'/>
            <cpu id='50' socket_id='0' die_id='0' core_id='72' siblings='48-51'/>
            <cpu id='51' socket_id='0' die_id='0' core_id='72' siblings='48-51'/>
            <cpu id='52' socket_id='0' die_id='0' core_id='76' siblings='52-55'/>
            <cpu id='53' socket_id='0' die_id='0' core_id='76' siblings='52-55'/>
            <cpu id='54' socket_id='0' die_id='0' core_id='76' siblings='52-55'/>
            <cpu id='55' socket_id='0' die_id='0' core_id='76' siblings='52-55'/>
            <cpu id='56' socket_id='0' die_id='0' core_id='80' siblings='56-59'/>
            <cpu id='57' socket_id='0' die_id='0' core_id='80' siblings='56-59'/>
            <cpu id='58' socket_id='0' die_id='0' core_id='80' siblings='56-59'/>
            <cpu id='59' socket_id='0' die_id='0' core_id='80' siblings='56-59'/>
            <cpu id='60' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
            <cpu id='61' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
            <cpu id='62' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
            <cpu id='63' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
          </cpus>
        </cell>
      </cells>
    </topology>

but that topology doesn't reflect the actual topology of the host:

Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2166.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127

Comment 32 Michal Skrivanek 2020-09-11 07:41:32 UTC
the original report is not on POWER, so this seems like a general problem

Comment 33 Michal Privoznik 2020-09-11 11:47:13 UTC
I've posted patches here:

https://www.redhat.com/archives/libvir-list/2020-September/msg00655.html

Comment 34 Michal Privoznik 2020-09-11 12:01:37 UTC
Pushed upstream:

551fb778f5 virnuma: Use numa_nodes_ptr when checking available NUMA nodes
a2df82b621 virnuma: Assume numa_bitmask_isbitset() exists

v6.7.0-137-g551fb778f5

Comment 36 Dan Zheng 2020-09-15 02:55:59 UTC
Reproduced with
libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        8
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2300.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127
NUMA node250 CPU(s): 
NUMA node251 CPU(s): 
NUMA node252 CPU(s): 
NUMA node253 CPU(s): 
NUMA node254 CPU(s): 
NUMA node255 CPU(s): 

But libvirt virsh capabilities shows:
<topology>
      <cells num='1'>
        <cell id='0'>
          <memory unit='KiB'>263893440</memory>
          <pages unit='KiB' size='64'>4123335</pages>
...
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='8' value='40'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='64'>
            <cpu id='0' socket_id='0' die_id='0' core_id='8' siblings='0-3'/>
...
            <cpu id='6' socket_id='0' die_id='0' core_id='12' siblings='4-7'/>
...
            <cpu id='63' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
          </cpus>
        </cell>
      </cells>
    </topology>

Comment 37 Dan Zheng 2020-09-15 03:03:43 UTC
Verify with scratch build on same host as above:

libvirt-6.6.0-5.el8_rc.a43d3b3261.ppc64le


<topology>
      <cells num='8'>
        <cell id='0'>
          <memory unit='KiB'>263893440</memory>
          <pages unit='KiB' size='64'>4123335</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
          <distances>
            <sibling id='0' value='10'/>
。。。
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='64'>
            <cpu id='0' socket_id='0' die_id='0' core_id='8' siblings='0-3'/>
。。。
            <cpu id='63' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
          </cpus>
        </cell>
        <cell id='8'>
          <memory unit='KiB'>268009856</memory>
          <pages unit='KiB' size='64'>4187654</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
          <distances>
            <sibling id='0' value='40'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='64'>
            <cpu id='64' socket_id='8' die_id='0' core_id='2056' siblings='64-67'/>
            <cpu id='65' socket_id='8' die_id='0' core_id='2056' siblings='64-67'/>
...
            <cpu id='127' socket_id='8' die_id='0' core_id='2140' siblings='124-127'/>
          </cpus>
        </cell>
        <cell id='250'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
...
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
        <cell id='251'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
            <sibling id='8' value='80'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
        <cell id='252'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
        <cell id='253'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
        <cell id='254'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
        <cell id='255'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
...
            <sibling id='255' value='10'/>
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
      </cells>
    </topology>

Now the capabilites output aligns to the host topology.

Comment 42 Dan Zheng 2020-09-16 10:05:18 UTC
Package:
libvirt-6.6.0-5.module+el8.3.0+8092+f9e72d7e.ppc64le

# virsh capabilities
...
   <topology>
      <cells num='2'>
        <cell id='0'>
          <memory unit='KiB'>31338304</memory>
          <pages unit='KiB' size='64'>489661</pages>
...
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='8' value='40'/>
          </distances>
          <cpus num='64'>
            <cpu id='0' socket_id='0' die_id='0' core_id='8' siblings='0-3'/>
 ...
            <cpu id='63' socket_id='0' die_id='0' core_id='92' siblings='60-63'/>
          </cpus>
        </cell>
        <cell id='8'>
          <memory unit='KiB'>33363008</memory>
          <pages unit='KiB' size='64'>521297</pages>
...
          <distances>
            <sibling id='0' value='40'/>
            <sibling id='8' value='10'/>
          </distances>
          <cpus num='64'>
            <cpu id='64' socket_id='8' die_id='0' core_id='2048' siblings='64-67'/>
...
            <cpu id='127' socket_id='8' die_id='0' core_id='2140' siblings='124-127'/>
          </cpus>
        </cell>
      </cells>
    </topology>


# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2166.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127

The outputs are consistent. So I mark it verified.

Comment 45 errata-xmlrpc 2020-11-17 17:51:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137