Bug 1816037 - virsh capabilities show incorrect topology cell id
Summary: virsh capabilities show incorrect topology cell id
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.2
Hardware: ppc64le
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Daniel Henrique Barboza (IBM)
QA Contact: Dan Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-23 08:23 UTC by Dan Zheng
Modified: 2020-11-17 17:48 UTC (History)
6 users (show)

Fixed In Version: libvirt-6.2.0-1.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-17 17:47:42 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dan Zheng 2020-03-23 08:23:03 UTC
Description of problem:
virsh capabilities show incorrect topology info and cell id

Version-Release number of selected component (if applicable):
libvirt-6.0.0-14.module+el8.2.0+6069+78a1cb09.ppc64le

How reproducible:
100%

Steps to Reproduce:
1. 'lscpu' show 2 sockets on the host
# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              160
On-line CPU(s) list: 0-159
Thread(s) per core:  4
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2166.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-79
NUMA node8 CPU(s):   80-159

2. 'virsh capabilities' show '1' socket in cpu topology.
# virsh capabilities
<capabilities>

  <host>
    <uuid>7714c83d-2719-4a89-8119-156778df62e7</uuid>
    <cpu>
      <arch>ppc64le</arch>
      <model>POWER9</model>
      <vendor>IBM</vendor>
      <topology sockets='1' dies='1' cores='20' threads='4'/>   <=== incorrect sockets number
...
    </cpu>


3. numa cell show incorrect number.
<topology>
      <cells num='2'>
        <cell id='0'>
          <memory unit='KiB'>263716608</memory>
          <cpus num='80'>
            <cpu id='0' socket_id='0' die_id='0' core_id='0' siblings='0-3'/>
            <cpu id='1' socket_id='0' die_id='0' core_id='0' siblings='0-3'/>

...
        </cell>
        <cell id='0'>         <=== incorrect id, should be '1'
          <memory unit='KiB'>263716608</memory>
          <cpus num='80'>
            <cpu id='80' socket_id='0' die_id='0' core_id='0' siblings='80-83'/>
            <cpu id='81' socket_id='0' die_id='0' core_id='0' siblings='80-83'/>
            <cpu id='82' socket_id='0' die_id='0' core_id='0' siblings='80-83'/>
            <cpu id='83' socket_id='0' die_id='0' core_id='0' siblings='80-83'/>
...

Actual results:
See above

Expected results:
1. topology socket number should be same with host info shown in lscpu
2. cell id should be correct.

Additional info:

Comment 1 Daniel Henrique Barboza (IBM) 2020-04-02 12:52:28 UTC
There are 2 expected results which resulted in 2 unrelated investigations. I'll
post them in separated comments for easier referral later.


"1. topology socket number should be same with host info shown in lscpu"

I consider this to be a documentation issue. Libvirt is not reporting the total socket
number of the host in the topology XML. It is reporting the number of sockets per
node. This is misleading because the Libvirt documentation states:


(https://libvirt.org/formatdomain.html)
topology

The topology element specifies requested topology of virtual CPU provided to the guest.
Four attributes, sockets, dies, cores, and threads, accept non-zero positive integer values.
They refer *to the total number of CPU sockets,* (...)


And then the user will expect it to match the output of lscpu. I've sent a patch
to amend the documentation to state that we're reporting the sockets per node:


https://www.redhat.com/archives/libvir-list/2020-April/msg00005.html

Aside from that I don't think there's much to be done. I experimented with changing
'sockets' to report the total sockets of the host, but in the end this change
would break a lot of guests that have more than one NUMA node and will end up
with inconsistent topologies. A better fix would be an extra element or
attribute to report this total socket number but I'm unsure if it's worth
the trouble.

As long as the user is informed that the existing 'sockets' value represents sockets
per NUMA node, the user can infer the total number of sockets given that we're
reporting the amount of NUMA nodes.

Comment 2 Daniel Henrique Barboza (IBM) 2020-04-02 13:10:21 UTC
About this expected result:

"2. cell id should be correct."

I've reproduced this behavior in a Power 8 server. Seeing Libvirt logs I noticed these
error messages:


2020-04-02 12:14:39.540+0000: 410848: error : virFileReadValueUint:4118 : internal error:
Invalid unsigned integer value '-1' in file '/sys/devices/system/cpu/cpu0/topology/die_id'
2020-04-02 12:14:39.540+0000: 410848: warning : virCapabilitiesHostNUMANewHost:1725 : Failed to query
host NUMA topology, faking single NUMA node 


What is happening here is that Libvirt is entering the "fake NUMA node", where Libvirt
creates a fake NUMA node when numactl isn't present with <cell id="0">. But in this
case this is happening because the call to virCapabilitiesHostNUMAInitReal() is failing to
execute. Long story short, the reason is here: 


2020-04-02 12:14:39.540+0000: 410848: error : virFileReadValueUint:4118 : internal error:
Invalid unsigned integer value '-1' in file '/sys/devices/system/cpu/cpu0/topology/die_id'



I've fixed this bug upstream already in a different context. Here's the commit:



commit 0137bf0dab2738d5443e2f407239856e2aa25bb3
Author: Daniel Henrique Barboza <danielhb413>
Date:   Mon Mar 16 21:01:34 2020 -0300

    virhostcpu.c: fix 'die_id' parsing for Power hosts

v6.1.0-164-g0137bf0dab



I've asserted that backporting this commit into the libvirt-6.0.0-14 codebase solves
this problem. This fix is present in the upcoming community libvirt-6.2.0 as well, so I
believe we can get the fix downstream via rebase.

Comment 3 Daniel Henrique Barboza (IBM) 2020-04-02 22:37:27 UTC
David Gibson suggested that this bug should be split in two since there are 2 problems with
2 trackable solutions.

The 'wrong topology info' is now being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1820376.

Bug title was changed to reflect that this bug will focus on the numa cell id
problem.

Comment 4 Daniel Henrique Barboza (IBM) 2020-04-02 22:38:53 UTC
Fixed upstream with this commit:


commit 0137bf0dab2738d5443e2f407239856e2aa25bb3
Author: Daniel Henrique Barboza <danielhb413>
Date:   Mon Mar 16 21:01:34 2020 -0300

    virhostcpu.c: fix 'die_id' parsing for Power hosts

v6.1.0-164-g0137bf0dab

Comment 7 Dan Zheng 2020-05-11 13:11:04 UTC
Package:
libvirt-6.3.0-1.module+el8.3.0+6478+69f490bb.ppc64le

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              160
On-line CPU(s) list: 0-159
Thread(s) per core:  4
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
...
NUMA node0 CPU(s):   0-79
NUMA node8 CPU(s):   80-159

# virsh capabilities
<capabilities>

  <host>
    <uuid>0c227238-ef50-4736-8ced-470904e8c7d2</uuid>
    <cpu>
      <arch>ppc64le</arch>
      <model>POWER9</model>
      <vendor>IBM</vendor>
      <topology sockets='1' dies='1' cores='20' threads='4'/>
      <pages unit='KiB' size='64'/>
...
    </cpu>
   <topology>
      <cells num='2'>
        <cell id='0'>                     
          <memory unit='KiB'>129794944</memory>
          <pages unit='KiB' size='64'>2028046</pages>
...
            <cpu id='79' socket_id='0' die_id='0' core_id='84' siblings='76-79'/>
          </cpus>
        </cell>
        <cell id='8'>    
          <memory unit='KiB'>133912704</memory>
          <pages unit='KiB' size='64'>2092386</pages>
...
      </cells>
    </topology>

The cell ids can be displayed correctly. So I mark it verified.

Comment 10 errata-xmlrpc 2020-11-17 17:47:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137


Note You need to log in before you can comment on or make changes to this bug.