Description of problem: virsh capabilities show incorrect topology info and cell id Version-Release number of selected component (if applicable): libvirt-6.0.0-14.module+el8.2.0+6069+78a1cb09.ppc64le How reproducible: 100% Steps to Reproduce: 1. 'lscpu' show 2 sockets on the host # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 160 On-line CPU(s) list: 0-159 Thread(s) per core: 4 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Model: 2.2 (pvr 004e 1202) Model name: POWER9, altivec supported CPU max MHz: 3800.0000 CPU min MHz: 2166.0000 L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-79 NUMA node8 CPU(s): 80-159 2. 'virsh capabilities' show '1' socket in cpu topology. # virsh capabilities <capabilities> <host> <uuid>7714c83d-2719-4a89-8119-156778df62e7</uuid> <cpu> <arch>ppc64le</arch> <model>POWER9</model> <vendor>IBM</vendor> <topology sockets='1' dies='1' cores='20' threads='4'/> <=== incorrect sockets number ... </cpu> 3. numa cell show incorrect number. <topology> <cells num='2'> <cell id='0'> <memory unit='KiB'>263716608</memory> <cpus num='80'> <cpu id='0' socket_id='0' die_id='0' core_id='0' siblings='0-3'/> <cpu id='1' socket_id='0' die_id='0' core_id='0' siblings='0-3'/> ... </cell> <cell id='0'> <=== incorrect id, should be '1' <memory unit='KiB'>263716608</memory> <cpus num='80'> <cpu id='80' socket_id='0' die_id='0' core_id='0' siblings='80-83'/> <cpu id='81' socket_id='0' die_id='0' core_id='0' siblings='80-83'/> <cpu id='82' socket_id='0' die_id='0' core_id='0' siblings='80-83'/> <cpu id='83' socket_id='0' die_id='0' core_id='0' siblings='80-83'/> ... Actual results: See above Expected results: 1. topology socket number should be same with host info shown in lscpu 2. cell id should be correct. Additional info:
There are 2 expected results which resulted in 2 unrelated investigations. I'll post them in separated comments for easier referral later. "1. topology socket number should be same with host info shown in lscpu" I consider this to be a documentation issue. Libvirt is not reporting the total socket number of the host in the topology XML. It is reporting the number of sockets per node. This is misleading because the Libvirt documentation states: (https://libvirt.org/formatdomain.html) topology The topology element specifies requested topology of virtual CPU provided to the guest. Four attributes, sockets, dies, cores, and threads, accept non-zero positive integer values. They refer *to the total number of CPU sockets,* (...) And then the user will expect it to match the output of lscpu. I've sent a patch to amend the documentation to state that we're reporting the sockets per node: https://www.redhat.com/archives/libvir-list/2020-April/msg00005.html Aside from that I don't think there's much to be done. I experimented with changing 'sockets' to report the total sockets of the host, but in the end this change would break a lot of guests that have more than one NUMA node and will end up with inconsistent topologies. A better fix would be an extra element or attribute to report this total socket number but I'm unsure if it's worth the trouble. As long as the user is informed that the existing 'sockets' value represents sockets per NUMA node, the user can infer the total number of sockets given that we're reporting the amount of NUMA nodes.
About this expected result: "2. cell id should be correct." I've reproduced this behavior in a Power 8 server. Seeing Libvirt logs I noticed these error messages: 2020-04-02 12:14:39.540+0000: 410848: error : virFileReadValueUint:4118 : internal error: Invalid unsigned integer value '-1' in file '/sys/devices/system/cpu/cpu0/topology/die_id' 2020-04-02 12:14:39.540+0000: 410848: warning : virCapabilitiesHostNUMANewHost:1725 : Failed to query host NUMA topology, faking single NUMA node What is happening here is that Libvirt is entering the "fake NUMA node", where Libvirt creates a fake NUMA node when numactl isn't present with <cell id="0">. But in this case this is happening because the call to virCapabilitiesHostNUMAInitReal() is failing to execute. Long story short, the reason is here: 2020-04-02 12:14:39.540+0000: 410848: error : virFileReadValueUint:4118 : internal error: Invalid unsigned integer value '-1' in file '/sys/devices/system/cpu/cpu0/topology/die_id' I've fixed this bug upstream already in a different context. Here's the commit: commit 0137bf0dab2738d5443e2f407239856e2aa25bb3 Author: Daniel Henrique Barboza <danielhb413> Date: Mon Mar 16 21:01:34 2020 -0300 virhostcpu.c: fix 'die_id' parsing for Power hosts v6.1.0-164-g0137bf0dab I've asserted that backporting this commit into the libvirt-6.0.0-14 codebase solves this problem. This fix is present in the upcoming community libvirt-6.2.0 as well, so I believe we can get the fix downstream via rebase.
David Gibson suggested that this bug should be split in two since there are 2 problems with 2 trackable solutions. The 'wrong topology info' is now being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1820376. Bug title was changed to reflect that this bug will focus on the numa cell id problem.
Fixed upstream with this commit: commit 0137bf0dab2738d5443e2f407239856e2aa25bb3 Author: Daniel Henrique Barboza <danielhb413> Date: Mon Mar 16 21:01:34 2020 -0300 virhostcpu.c: fix 'die_id' parsing for Power hosts v6.1.0-164-g0137bf0dab
Package: libvirt-6.3.0-1.module+el8.3.0+6478+69f490bb.ppc64le # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 160 On-line CPU(s) list: 0-159 Thread(s) per core: 4 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Model: 2.2 (pvr 004e 1202) Model name: POWER9, altivec supported ... NUMA node0 CPU(s): 0-79 NUMA node8 CPU(s): 80-159 # virsh capabilities <capabilities> <host> <uuid>0c227238-ef50-4736-8ced-470904e8c7d2</uuid> <cpu> <arch>ppc64le</arch> <model>POWER9</model> <vendor>IBM</vendor> <topology sockets='1' dies='1' cores='20' threads='4'/> <pages unit='KiB' size='64'/> ... </cpu> <topology> <cells num='2'> <cell id='0'> <memory unit='KiB'>129794944</memory> <pages unit='KiB' size='64'>2028046</pages> ... <cpu id='79' socket_id='0' die_id='0' core_id='84' siblings='76-79'/> </cpus> </cell> <cell id='8'> <memory unit='KiB'>133912704</memory> <pages unit='KiB' size='64'>2092386</pages> ... </cells> </topology> The cell ids can be displayed correctly. So I mark it verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5137