Bug 1876956 - Reported topology doesn't match actual host topology
Summary: Reported topology doesn't match actual host topology
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.3
Hardware: ppc64le
OS: Linux
unspecified
urgent
Target Milestone: rc
: 8.3
Assignee: Michal Privoznik
QA Contact: Dan Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-08 15:07 UTC by Roni
Modified: 2022-05-06 06:48 UTC (History)
19 users (show)

Fixed In Version: libvirt-6.6.0-5.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-17 17:51:44 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (1.38 MB, application/zip)
2020-09-08 15:07 UTC, Roni
no flags Details

Description Roni 2020-09-08 15:07:45 UTC
Created attachment 1714146 [details]
logs

Created attachment 1714146 [details]
logs

Description of problem:
hosted-engine setup fails on some environments

Version-Release number of selected component (if applicable):
4.4.3.1-0.7.el8ev

How reproducible:
50%

Steps to Reproduce:
1. Deploy hosted-engine using ansible (ovirt-ansible-hosted-engine-setup)
2.
3.

Actual results:
TASK [ovirt.hosted_engine_setup : Fail with error description]
fatal: [lynx01.lab.eng.tlv2.redhat.com]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, deployment errors:   code 505: Host host_mixed_1 installation failed. Failed to configure management network on the host.,    code 9000: Failed to verify Power Management configuration for Host host_mixed_1.,   fix accordingly and re-deploy."}

Expected results:
hosted-engine should successfully be deployed

Additional info:
It does not reproduce on environments named: GE-6/8/9, 
but reproduce on all others

Comment 9 Avihai 2020-09-10 08:44:54 UTC
Raising to urgent priority as this is blocking NUMA nodes installations and overloading QE automation regression suits as we use multiple ENV's with NUMA hosts which now are broken due to this issue.

Comment 26 Martin Perina 2020-09-10 11:06:51 UTC
Arik, could someone from virt team take a look? NUMA was originally an SLA team feature and unfortunately we don't know pretty much anything about it in infra

Comment 29 Arik 2020-09-10 20:06:50 UTC
due to the reported distances:
'numaNodeDistance': {'0': [10, 40]}

the engine assumes the host has 2 numa nodes but only one node is reported by vdsm:
'numaNodes': {'0': {'totalMemory': '63184', 'hugepages': {'64': {'totalPages': '440509'}, '2048': {'totalPages': '512'}, '1048576': {'totalPages': '2'}}, 'cpus': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}},

that corresponds to what libvirt reports in the 'capabilities':
    <topology>
      <cells num='1'>
        <cell id='0'>
          <memory unit='KiB'>31338304</memory>
          <pages unit='KiB' size='64'>440509</pages>
          <pages unit='KiB' size='2048'>512</pages>
          <pages unit='KiB' size='1048576'>2</pages>
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='8' value='40'/>
          </distances>
          <cpus num='64'>
            <cpu id='0' socket_id='0' die_id='0' core_id='0' siblings='0-3'/>
            <cpu id='1' socket_id='0' die_id='0' core_id='0' siblings='0-3'/>
            <cpu id='2' socket_id='0' die_id='0' core_id='0' siblings='0-3'/>
            <cpu id='3' socket_id='0' die_id='0' core_id='0' siblings='0-3'/>
            <cpu id='4' socket_id='0' die_id='0' core_id='4' siblings='4-7'/>
            <cpu id='5' socket_id='0' die_id='0' core_id='4' siblings='4-7'/>
            <cpu id='6' socket_id='0' die_id='0' core_id='4' siblings='4-7'/>
            <cpu id='7' socket_id='0' die_id='0' core_id='4' siblings='4-7'/>
            <cpu id='8' socket_id='0' die_id='0' core_id='8' siblings='8-11'/>
            <cpu id='9' socket_id='0' die_id='0' core_id='8' siblings='8-11'/>
            <cpu id='10' socket_id='0' die_id='0' core_id='8' siblings='8-11'/>
            <cpu id='11' socket_id='0' die_id='0' core_id='8' siblings='8-11'/>
            <cpu id='12' socket_id='0' die_id='0' core_id='12' siblings='12-15'/>
            <cpu id='13' socket_id='0' die_id='0' core_id='12' siblings='12-15'/>
            <cpu id='14' socket_id='0' die_id='0' core_id='12' siblings='12-15'/>
            <cpu id='15' socket_id='0' die_id='0' core_id='12' siblings='12-15'/>
            <cpu id='16' socket_id='0' die_id='0' core_id='24' siblings='16-19'/>
            <cpu id='17' socket_id='0' die_id='0' core_id='24' siblings='16-19'/>
            <cpu id='18' socket_id='0' die_id='0' core_id='24' siblings='16-19'/>
            <cpu id='19' socket_id='0' die_id='0' core_id='24' siblings='16-19'/>
            <cpu id='20' socket_id='0' die_id='0' core_id='28' siblings='20-23'/>
            <cpu id='21' socket_id='0' die_id='0' core_id='28' siblings='20-23'/>
            <cpu id='22' socket_id='0' die_id='0' core_id='28' siblings='20-23'/>
            <cpu id='23' socket_id='0' die_id='0' core_id='28' siblings='20-23'/>
            <cpu id='24' socket_id='0' die_id='0' core_id='40' siblings='24-27'/>
            <cpu id='25' socket_id='0' die_id='0' core_id='40' siblings='24-27'/>
            <cpu id='26' socket_id='0' die_id='0' core_id='40' siblings='24-27'/>
            <cpu id='27' socket_id='0' die_id='0' core_id='40' siblings='24-27'/>
            <cpu id='28' socket_id='0' die_id='0' core_id='44' siblings='28-31'/>
            <cpu id='29' socket_id='0' die_id='0' core_id='44' siblings='28-31'/>
            <cpu id='30' socket_id='0' die_id='0' core_id='44' siblings='28-31'/>
            <cpu id='31' socket_id='0' die_id='0' core_id='44' siblings='28-31'/>
            <cpu id='32' socket_id='0' die_id='0' core_id='48' siblings='32-35'/>
            <cpu id='33' socket_id='0' die_id='0' core_id='48' siblings='32-35'/>
            <cpu id='34' socket_id='0' die_id='0' core_id='48' siblings='32-35'/>
            <cpu id='35' socket_id='0' die_id='0' core_id='48' siblings='32-35'/>
            <cpu id='36' socket_id='0' die_id='0' core_id='52' siblings='36-39'/>
            <cpu id='37' socket_id='0' die_id='0' core_id='52' siblings='36-39'/>
            <cpu id='38' socket_id='0' die_id='0' core_id='52' siblings='36-39'/>
            <cpu id='39' socket_id='0' die_id='0' core_id='52' siblings='36-39'/>
            <cpu id='40' socket_id='0' die_id='0' core_id='56' siblings='40-43'/>
            <cpu id='41' socket_id='0' die_id='0' core_id='56' siblings='40-43'/>
            <cpu id='42' socket_id='0' die_id='0' core_id='56' siblings='40-43'/>
            <cpu id='43' socket_id='0' die_id='0' core_id='56' siblings='40-43'/>
            <cpu id='44' socket_id='0' die_id='0' core_id='60' siblings='44-47'/>
            <cpu id='45' socket_id='0' die_id='0' core_id='60' siblings='44-47'/>
            <cpu id='46' socket_id='0' die_id='0' core_id='60' siblings='44-47'/>
            <cpu id='47' socket_id='0' die_id='0' core_id='60' siblings='44-47'/>
            <cpu id='48' socket_id='0' die_id='0' core_id='72' siblings='48-51'/>
            <cpu id='49' socket_id='0' die_id='0' core_id='72' siblings='48-51'/>
            <cpu id='50' socket_id='0' die_id='0' core_id='72' siblings='48-51'/>
            <cpu id='51' socket_id='0' die_id='0' core_id='72' siblings='48-51'/>
            <cpu id='52' socket_id='0' die_id='0' core_id='76' siblings='52-55'/>
            <cpu id='53' socket_id='0' die_id='0' core_id='76' siblings='52-55'/>
            <cpu id='54' socket_id='0' die_id='0' core_id='76' siblings='52-55'/>
            <cpu id='55' socket_id='0' die_id='0' core_id='76' siblings='52-55'/>
            <cpu id='56' socket_id='0' die_id='0' core_id='80' siblings='56-59'/>
            <cpu id='57' socket_id='0' die_id='0' core_id='80' siblings='56-59'/>
            <cpu id='58' socket_id='0' die_id='0' core_id='80' siblings='56-59'/>
            <cpu id='59' socket_id='0' die_id='0' core_id='80' siblings='56-59'/>
            <cpu id='60' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
            <cpu id='61' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
            <cpu id='62' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
            <cpu id='63' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
          </cpus>
        </cell>
      </cells>
    </topology>

but that topology doesn't reflect the actual topology of the host:

Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2166.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127

Comment 32 Michal Skrivanek 2020-09-11 07:41:32 UTC
the original report is not on POWER, so this seems like a general problem

Comment 33 Michal Privoznik 2020-09-11 11:47:13 UTC
I've posted patches here:

https://www.redhat.com/archives/libvir-list/2020-September/msg00655.html

Comment 34 Michal Privoznik 2020-09-11 12:01:37 UTC
Pushed upstream:

551fb778f5 virnuma: Use numa_nodes_ptr when checking available NUMA nodes
a2df82b621 virnuma: Assume numa_bitmask_isbitset() exists

v6.7.0-137-g551fb778f5

Comment 36 Dan Zheng 2020-09-15 02:55:59 UTC
Reproduced with
libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        8
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2300.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127
NUMA node250 CPU(s): 
NUMA node251 CPU(s): 
NUMA node252 CPU(s): 
NUMA node253 CPU(s): 
NUMA node254 CPU(s): 
NUMA node255 CPU(s): 

But libvirt virsh capabilities shows:
<topology>
      <cells num='1'>
        <cell id='0'>
          <memory unit='KiB'>263893440</memory>
          <pages unit='KiB' size='64'>4123335</pages>
...
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='8' value='40'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='64'>
            <cpu id='0' socket_id='0' die_id='0' core_id='8' siblings='0-3'/>
...
            <cpu id='6' socket_id='0' die_id='0' core_id='12' siblings='4-7'/>
...
            <cpu id='63' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
          </cpus>
        </cell>
      </cells>
    </topology>

Comment 37 Dan Zheng 2020-09-15 03:03:43 UTC
Verify with scratch build on same host as above:

libvirt-6.6.0-5.el8_rc.a43d3b3261.ppc64le


<topology>
      <cells num='8'>
        <cell id='0'>
          <memory unit='KiB'>263893440</memory>
          <pages unit='KiB' size='64'>4123335</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
          <distances>
            <sibling id='0' value='10'/>
。。。
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='64'>
            <cpu id='0' socket_id='0' die_id='0' core_id='8' siblings='0-3'/>
。。。
            <cpu id='63' socket_id='0' die_id='0' core_id='84' siblings='60-63'/>
          </cpus>
        </cell>
        <cell id='8'>
          <memory unit='KiB'>268009856</memory>
          <pages unit='KiB' size='64'>4187654</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <pages unit='KiB' size='1048576'>0</pages>
          <distances>
            <sibling id='0' value='40'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='64'>
            <cpu id='64' socket_id='8' die_id='0' core_id='2056' siblings='64-67'/>
            <cpu id='65' socket_id='8' die_id='0' core_id='2056' siblings='64-67'/>
...
            <cpu id='127' socket_id='8' die_id='0' core_id='2140' siblings='124-127'/>
          </cpus>
        </cell>
        <cell id='250'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
...
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
        <cell id='251'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
            <sibling id='8' value='80'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
        <cell id='252'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
        <cell id='253'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
        <cell id='254'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
...
            <sibling id='255' value='80'/>
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
        <cell id='255'>
          <pages unit='KiB' size='64'>0</pages>
          <distances>
            <sibling id='0' value='80'/>
...
            <sibling id='255' value='10'/>
          </distances>
          <cpus num='0'>
          </cpus>
        </cell>
      </cells>
    </topology>

Now the capabilites output aligns to the host topology.

Comment 42 Dan Zheng 2020-09-16 10:05:18 UTC
Package:
libvirt-6.6.0-5.module+el8.3.0+8092+f9e72d7e.ppc64le

# virsh capabilities
...
   <topology>
      <cells num='2'>
        <cell id='0'>
          <memory unit='KiB'>31338304</memory>
          <pages unit='KiB' size='64'>489661</pages>
...
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='8' value='40'/>
          </distances>
          <cpus num='64'>
            <cpu id='0' socket_id='0' die_id='0' core_id='8' siblings='0-3'/>
 ...
            <cpu id='63' socket_id='0' die_id='0' core_id='92' siblings='60-63'/>
          </cpus>
        </cell>
        <cell id='8'>
          <memory unit='KiB'>33363008</memory>
          <pages unit='KiB' size='64'>521297</pages>
...
          <distances>
            <sibling id='0' value='40'/>
            <sibling id='8' value='10'/>
          </distances>
          <cpus num='64'>
            <cpu id='64' socket_id='8' die_id='0' core_id='2048' siblings='64-67'/>
...
            <cpu id='127' socket_id='8' die_id='0' core_id='2140' siblings='124-127'/>
          </cpus>
        </cell>
      </cells>
    </topology>


# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2166.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127

The outputs are consistent. So I mark it verified.

Comment 45 errata-xmlrpc 2020-11-17 17:51:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137


Note You need to log in before you can comment on or make changes to this bug.