Created attachment 1714146 [details] logs Created attachment 1714146 [details] logs Description of problem: hosted-engine setup fails on some environments Version-Release number of selected component (if applicable): 4.4.3.1-0.7.el8ev How reproducible: 50% Steps to Reproduce: 1. Deploy hosted-engine using ansible (ovirt-ansible-hosted-engine-setup) 2. 3. Actual results: TASK [ovirt.hosted_engine_setup : Fail with error description] fatal: [lynx01.lab.eng.tlv2.redhat.com]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, deployment errors: code 505: Host host_mixed_1 installation failed. Failed to configure management network on the host., code 9000: Failed to verify Power Management configuration for Host host_mixed_1., fix accordingly and re-deploy."} Expected results: hosted-engine should successfully be deployed Additional info: It does not reproduce on environments named: GE-6/8/9, but reproduce on all others
Raising to urgent priority as this is blocking NUMA nodes installations and overloading QE automation regression suits as we use multiple ENV's with NUMA hosts which now are broken due to this issue.
Arik, could someone from virt team take a look? NUMA was originally an SLA team feature and unfortunately we don't know pretty much anything about it in infra
due to the reported distances: 'numaNodeDistance': {'0': [10, 40]} the engine assumes the host has 2 numa nodes but only one node is reported by vdsm: 'numaNodes': {'0': {'totalMemory': '63184', 'hugepages': {'64': {'totalPages': '440509'}, '2048': {'totalPages': '512'}, '1048576': {'totalPages': '2'}}, 'cpus': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}}, that corresponds to what libvirt reports in the 'capabilities': <topology> <cells num='1'> <cell id='0'> <memory unit='KiB'>31338304</memory> <pages unit='KiB' size='64'>440509</pages> <pages unit='KiB' size='2048'>512</pages> <pages unit='KiB' size='1048576'>2</pages> <distances> <sibling id='0' value='10'/> <sibling id='8' value='40'/> </distances> <cpus num='64'> <cpu id='0' socket_id='0' die_id='0' core_id='0' siblings='0-3'/> <cpu id='1' socket_id='0' die_id='0' core_id='0' siblings='0-3'/> <cpu id='2' socket_id='0' die_id='0' core_id='0' siblings='0-3'/> <cpu id='3' socket_id='0' die_id='0' core_id='0' siblings='0-3'/> <cpu id='4' socket_id='0' die_id='0' core_id='4' siblings='4-7'/> <cpu id='5' socket_id='0' die_id='0' core_id='4' siblings='4-7'/> <cpu id='6' socket_id='0' die_id='0' core_id='4' siblings='4-7'/> <cpu id='7' socket_id='0' die_id='0' core_id='4' siblings='4-7'/> <cpu id='8' socket_id='0' die_id='0' core_id='8' siblings='8-11'/> <cpu id='9' socket_id='0' die_id='0' core_id='8' siblings='8-11'/> <cpu id='10' socket_id='0' die_id='0' core_id='8' siblings='8-11'/> <cpu id='11' socket_id='0' die_id='0' core_id='8' siblings='8-11'/> <cpu id='12' socket_id='0' die_id='0' core_id='12' siblings='12-15'/> <cpu id='13' socket_id='0' die_id='0' core_id='12' siblings='12-15'/> <cpu id='14' socket_id='0' die_id='0' core_id='12' siblings='12-15'/> <cpu id='15' socket_id='0' die_id='0' core_id='12' siblings='12-15'/> <cpu id='16' socket_id='0' die_id='0' core_id='24' siblings='16-19'/> <cpu id='17' socket_id='0' die_id='0' core_id='24' siblings='16-19'/> <cpu id='18' socket_id='0' die_id='0' core_id='24' siblings='16-19'/> <cpu id='19' socket_id='0' die_id='0' core_id='24' siblings='16-19'/> <cpu id='20' socket_id='0' die_id='0' core_id='28' siblings='20-23'/> <cpu id='21' socket_id='0' die_id='0' core_id='28' siblings='20-23'/> <cpu id='22' socket_id='0' die_id='0' core_id='28' siblings='20-23'/> <cpu id='23' socket_id='0' die_id='0' core_id='28' siblings='20-23'/> <cpu id='24' socket_id='0' die_id='0' core_id='40' siblings='24-27'/> <cpu id='25' socket_id='0' die_id='0' core_id='40' siblings='24-27'/> <cpu id='26' socket_id='0' die_id='0' core_id='40' siblings='24-27'/> <cpu id='27' socket_id='0' die_id='0' core_id='40' siblings='24-27'/> <cpu id='28' socket_id='0' die_id='0' core_id='44' siblings='28-31'/> <cpu id='29' socket_id='0' die_id='0' core_id='44' siblings='28-31'/> <cpu id='30' socket_id='0' die_id='0' core_id='44' siblings='28-31'/> <cpu id='31' socket_id='0' die_id='0' core_id='44' siblings='28-31'/> <cpu id='32' socket_id='0' die_id='0' core_id='48' siblings='32-35'/> <cpu id='33' socket_id='0' die_id='0' core_id='48' siblings='32-35'/> <cpu id='34' socket_id='0' die_id='0' core_id='48' siblings='32-35'/> <cpu id='35' socket_id='0' die_id='0' core_id='48' siblings='32-35'/> <cpu id='36' socket_id='0' die_id='0' core_id='52' siblings='36-39'/> <cpu id='37' socket_id='0' die_id='0' core_id='52' siblings='36-39'/> <cpu id='38' socket_id='0' die_id='0' core_id='52' siblings='36-39'/> <cpu id='39' socket_id='0' die_id='0' core_id='52' siblings='36-39'/> <cpu id='40' socket_id='0' die_id='0' core_id='56' siblings='40-43'/> <cpu id='41' socket_id='0' die_id='0' core_id='56' siblings='40-43'/> <cpu id='42' socket_id='0' die_id='0' core_id='56' siblings='40-43'/> <cpu id='43' socket_id='0' die_id='0' core_id='56' siblings='40-43'/> <cpu id='44' socket_id='0' die_id='0' core_id='60' siblings='44-47'/> <cpu id='45' socket_id='0' die_id='0' core_id='60' siblings='44-47'/> <cpu id='46' socket_id='0' die_id='0' core_id='60' siblings='44-47'/> <cpu id='47' socket_id='0' die_id='0' core_id='60' siblings='44-47'/> <cpu id='48' socket_id='0' die_id='0' core_id='72' siblings='48-51'/> <cpu id='49' socket_id='0' die_id='0' core_id='72' siblings='48-51'/> <cpu id='50' socket_id='0' die_id='0' core_id='72' siblings='48-51'/> <cpu id='51' socket_id='0' die_id='0' core_id='72' siblings='48-51'/> <cpu id='52' socket_id='0' die_id='0' core_id='76' siblings='52-55'/> <cpu id='53' socket_id='0' die_id='0' core_id='76' siblings='52-55'/> <cpu id='54' socket_id='0' die_id='0' core_id='76' siblings='52-55'/> <cpu id='55' socket_id='0' die_id='0' core_id='76' siblings='52-55'/> <cpu id='56' socket_id='0' die_id='0' core_id='80' siblings='56-59'/> <cpu id='57' socket_id='0' die_id='0' core_id='80' siblings='56-59'/> <cpu id='58' socket_id='0' die_id='0' core_id='80' siblings='56-59'/> <cpu id='59' socket_id='0' die_id='0' core_id='80' siblings='56-59'/> <cpu id='60' socket_id='0' die_id='0' core_id='84' siblings='60-63'/> <cpu id='61' socket_id='0' die_id='0' core_id='84' siblings='60-63'/> <cpu id='62' socket_id='0' die_id='0' core_id='84' siblings='60-63'/> <cpu id='63' socket_id='0' die_id='0' core_id='84' siblings='60-63'/> </cpus> </cell> </cells> </topology> but that topology doesn't reflect the actual topology of the host: Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Model: 2.2 (pvr 004e 1202) Model name: POWER9, altivec supported CPU max MHz: 3800.0000 CPU min MHz: 2166.0000 L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-63 NUMA node8 CPU(s): 64-127
the original report is not on POWER, so this seems like a general problem
I've posted patches here: https://www.redhat.com/archives/libvir-list/2020-September/msg00655.html
Pushed upstream: 551fb778f5 virnuma: Use numa_nodes_ptr when checking available NUMA nodes a2df82b621 virnuma: Assume numa_bitmask_isbitset() exists v6.7.0-137-g551fb778f5
Reproduced with libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8 # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 8 Model: 2.2 (pvr 004e 1202) Model name: POWER9, altivec supported CPU max MHz: 3800.0000 CPU min MHz: 2300.0000 L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-63 NUMA node8 CPU(s): 64-127 NUMA node250 CPU(s): NUMA node251 CPU(s): NUMA node252 CPU(s): NUMA node253 CPU(s): NUMA node254 CPU(s): NUMA node255 CPU(s): But libvirt virsh capabilities shows: <topology> <cells num='1'> <cell id='0'> <memory unit='KiB'>263893440</memory> <pages unit='KiB' size='64'>4123335</pages> ... <distances> <sibling id='0' value='10'/> <sibling id='8' value='40'/> ... <sibling id='255' value='80'/> </distances> <cpus num='64'> <cpu id='0' socket_id='0' die_id='0' core_id='8' siblings='0-3'/> ... <cpu id='6' socket_id='0' die_id='0' core_id='12' siblings='4-7'/> ... <cpu id='63' socket_id='0' die_id='0' core_id='84' siblings='60-63'/> </cpus> </cell> </cells> </topology>
Verify with scratch build on same host as above: libvirt-6.6.0-5.el8_rc.a43d3b3261.ppc64le <topology> <cells num='8'> <cell id='0'> <memory unit='KiB'>263893440</memory> <pages unit='KiB' size='64'>4123335</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>0</pages> <distances> <sibling id='0' value='10'/> 。。。 <sibling id='255' value='80'/> </distances> <cpus num='64'> <cpu id='0' socket_id='0' die_id='0' core_id='8' siblings='0-3'/> 。。。 <cpu id='63' socket_id='0' die_id='0' core_id='84' siblings='60-63'/> </cpus> </cell> <cell id='8'> <memory unit='KiB'>268009856</memory> <pages unit='KiB' size='64'>4187654</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>0</pages> <distances> <sibling id='0' value='40'/> ... <sibling id='255' value='80'/> </distances> <cpus num='64'> <cpu id='64' socket_id='8' die_id='0' core_id='2056' siblings='64-67'/> <cpu id='65' socket_id='8' die_id='0' core_id='2056' siblings='64-67'/> ... <cpu id='127' socket_id='8' die_id='0' core_id='2140' siblings='124-127'/> </cpus> </cell> <cell id='250'> <pages unit='KiB' size='64'>0</pages> <distances> <sibling id='0' value='80'/> ... </distances> <cpus num='0'> </cpus> </cell> <cell id='251'> <pages unit='KiB' size='64'>0</pages> <distances> <sibling id='0' value='80'/> <sibling id='8' value='80'/> ... <sibling id='255' value='80'/> </distances> <cpus num='0'> </cpus> </cell> <cell id='252'> <pages unit='KiB' size='64'>0</pages> <distances> <sibling id='0' value='80'/> ... <sibling id='255' value='80'/> </distances> <cpus num='0'> </cpus> </cell> <cell id='253'> <pages unit='KiB' size='64'>0</pages> <distances> <sibling id='0' value='80'/> ... <sibling id='255' value='80'/> </distances> <cpus num='0'> </cpus> </cell> <cell id='254'> <pages unit='KiB' size='64'>0</pages> <distances> <sibling id='0' value='80'/> ... <sibling id='255' value='80'/> </distances> <cpus num='0'> </cpus> </cell> <cell id='255'> <pages unit='KiB' size='64'>0</pages> <distances> <sibling id='0' value='80'/> ... <sibling id='255' value='10'/> </distances> <cpus num='0'> </cpus> </cell> </cells> </topology> Now the capabilites output aligns to the host topology.
Package: libvirt-6.6.0-5.module+el8.3.0+8092+f9e72d7e.ppc64le # virsh capabilities ... <topology> <cells num='2'> <cell id='0'> <memory unit='KiB'>31338304</memory> <pages unit='KiB' size='64'>489661</pages> ... <distances> <sibling id='0' value='10'/> <sibling id='8' value='40'/> </distances> <cpus num='64'> <cpu id='0' socket_id='0' die_id='0' core_id='8' siblings='0-3'/> ... <cpu id='63' socket_id='0' die_id='0' core_id='92' siblings='60-63'/> </cpus> </cell> <cell id='8'> <memory unit='KiB'>33363008</memory> <pages unit='KiB' size='64'>521297</pages> ... <distances> <sibling id='0' value='40'/> <sibling id='8' value='10'/> </distances> <cpus num='64'> <cpu id='64' socket_id='8' die_id='0' core_id='2048' siblings='64-67'/> ... <cpu id='127' socket_id='8' die_id='0' core_id='2140' siblings='124-127'/> </cpus> </cell> </cells> </topology> # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Model: 2.2 (pvr 004e 1202) Model name: POWER9, altivec supported CPU max MHz: 3800.0000 CPU min MHz: 2166.0000 L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-63 NUMA node8 CPU(s): 64-127 The outputs are consistent. So I mark it verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5137