RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1351160 - qemu-kvm does not expose expected cpu topology to guest with numa defined
Summary: qemu-kvm does not expose expected cpu topology to guest with numa defined
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.2
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: 7.3
Assignee: Eduardo Habkost
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 1018952
TreeView+ depends on / blocked
 
Reported: 2016-06-29 11:27 UTC by Timothy Rees
Modified: 2020-05-14 15:13 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-02 00:13:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Timothy Rees 2016-06-29 11:27:19 UTC
Description of problem:

When creating a single large VM with qemu-kvm to replicate the host CPU and NUMA topology, the CPU topology does not display correctly on the guest.

We are currently testing SAP HANA in a dedicated large VM on a single RHEV host and it is vital that qemu shows the correct CPU topology, NUMA topology and proc siblings information as part of the performance testing.

Test with 2 scenarios:
A: Guest cpu configured with mode=host-passthrough
B: Guest cpu configured with mode=custom match=exact


Version-Release number of selected component (if applicable):

libvirt-1.2.17-13.el7_2.4.x86_64
libvirt-client-1.2.17-13.el7_2.4.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.13.x86_64


How reproducible:

100%

SCENARIO A:

Steps to Reproduce:
1. Create a VM on the host using maximum number of processors available
2. Edit the XML and add in the following elements:
............
  <cpu mode='host-passthrough'>
    <topology sockets='4' cores='15' threads='2'/>
    <model fallback='forbid'/>
    <numa>
      <cell id='0' cpus='0-14,60-74' memory='131072000' unit='KiB'/>
      <cell id='1' cpus='15-29,75-89' memory='131072000' unit='KiB'/>
      <cell id='2' cpus='30-44,90-104' memory='131072000' unit='KiB'/>
      <cell id='3' cpus='45-59,105-119' memory='121072640' unit='KiB'/>
    </numa>
  </cpu>
............

3. Boot and login to the guest:
[root@guestvm ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                120
On-line CPU(s) list:   0-119
Thread(s) per core:    1
Core(s) per socket:    16
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E7-4880 v2 @ 2.50GHz
Stepping:              7
CPU MHz:               2493.988
BogoMIPS:              4987.97
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              38400K
NUMA node0 CPU(s):     0-14,60-74
NUMA node1 CPU(s):     15-29,75-89
NUMA node2 CPU(s):     30-44,90-104
NUMA node3 CPU(s):     45-59,105-119

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 127999 MB
node 0 free: 125072 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 128000 MB
node 1 free: 125162 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 128000 MB
node 2 free: 125220 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 118235 MB
node 3 free: 115634 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10



Actual results:
Guest shows Sockets/Cores/Threads as 4/16/1


Expected results:
Per the host output and xml configuration, guest should show Sockets/Cores/Threads as 4/15/2


SCENARIO B:

Steps to Reproduce:
1. Create a VM on the host using maximum number of processors available
2. Edit the XML and add in the following elements, using the same number of sockets/cores/threads as defined on the host:
............
  <cpu mode='custom' match='exact'>
    <topology sockets='4' cores='15' threads='2'/>
    <model fallback='forbid'/>
    <numa>
      <cell id='0' cpus='0-14,60-74' memory='131072000' unit='KiB'/>
      <cell id='1' cpus='15-29,75-89' memory='131072000' unit='KiB'/>
      <cell id='2' cpus='30-44,90-104' memory='131072000' unit='KiB'/>
      <cell id='3' cpus='45-59,105-119' memory='121072640' unit='KiB'/>
    </numa>
  </cpu>
............

3. Boot and login to the guest:
[root@guestvm ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                120
On-line CPU(s) list:   0-119
Thread(s) per core:    1
Core(s) per socket:    16
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 13
Model name:            QEMU Virtual CPU version 2.3.0
Stepping:              3
CPU MHz:               2493.988
BogoMIPS:              4987.97
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-14,60-74
NUMA node1 CPU(s):     15-29,75-89
NUMA node2 CPU(s):     30-44,90-104
NUMA node3 CPU(s):     45-59,105-119

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 127999 MB
node 0 free: 125072 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 128000 MB
node 1 free: 125162 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 128000 MB
node 2 free: 125220 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 118235 MB
node 3 free: 115634 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10




Actual results:
Guest shows Sockets/Cores/Threads as 4/16/1
Guest CPU Model does not match host CPU Model
Guest CPU Caches do not match host CPU caches


Expected results:
Per the host output and xml configuration, guest should show Sockets/Cores/Threads as 4/15/2, cpu model and caches should match that of the host.



Additional info:
Host lscpu output
[root@hypervisor ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                120
On-line CPU(s) list:   0-119
Thread(s) per core:    2
Core(s) per socket:    15
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E7-4880 v2 @ 2.50GHz
Stepping:              7
CPU MHz:               1823.925
BogoMIPS:              4996.38
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              38400K
NUMA node0 CPU(s):     0-14,60-74
NUMA node1 CPU(s):     15-29,75-89
NUMA node2 CPU(s):     30-44,90-104
NUMA node3 CPU(s):     45-59,105-119

[root@hypervisor ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 130943 MB
node 0 free: 116339 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 131072 MB
node 1 free: 126032 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 131072 MB
node 2 free: 127861 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 131072 MB
node 3 free: 126456 MB
node distances:
node   0   1   2   3
  0:  10  11  11  11
  1:  11  10  11  11
  2:  11  11  10  11
  3:  11  11  11  10


* We have also seen the same behavior exhibited on another server, this is also an IvyBridge box however this has 8 CPUs.

* Taking away the NUMA topology from the XML in each of the scenarios will results in the correct Sockets/Cores/Threads showing in the virtual machine, however there is then no NUMA topology defined either in the guest.

* This work is taking place with our partner SAP to certify HANA workloads on RHEV.

* I will attach full xml files for each scenario.

Comment 2 Andrew Jones 2016-06-29 12:40:46 UTC
Please provide the qemu command lines ('pgrep -a qemu' while the guest is running).

Also, we should split scenarios A and B into two separate bugs; Scenario A deals with the topology change, and B with the unexpected guest cpu model (I think B is likely not-a-bug, but I'll allow others to respond)

Comment 3 Timothy Rees 2016-06-29 12:47:05 UTC
(In reply to Andrew Jones from comment #2)
> Please provide the qemu command lines ('pgrep -a qemu' while the guest is
> running).
> 
> Also, we should split scenarios A and B into two separate bugs; Scenario A
> deals with the topology change, and B with the unexpected guest cpu model (I
> think B is likely not-a-bug, but I'll allow others to respond)

[root@hypervisor ~]# pgrep -a qemu
103617 /usr/libexec/qemu-kvm -name rhel7.0-2 -S -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off,vmport=off -cpu host -m 502235 -realtime mlock=off -smp 120,sockets=4,cores=15,threads=2 -numa node,nodeid=0,cpus=0-14,cpus=60-74,mem=128000 -numa node,nodeid=1,cpus=15-29,cpus=75-89,mem=128000 -numa node,nodeid=2,cpus=30-44,cpus=90-104,mem=128000 -numa node,nodeid=3,cpus=45-59,cpus=105-119,mem=118235 -uuid 0174311e-bf55-41c9-be80-d097bc3f73e4 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-rhel7.0-2/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/dev/vg_sys_r1/lv_vm_rhel7,if=none,id=drive-virtio-disk1,format=raw,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1 -drive file=/dev/vg_sys_r1/lv_vm_rhel7_usr_sap,if=none,id=drive-virtio-disk2,format=raw,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk2,id=virtio-disk2 -drive file=/dev/vg_sys_r1/lv_hana,if=none,id=drive-virtio-disk3,format=raw,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk3,id=virtio-disk3 -drive if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=26 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:0e:c7:25,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-rhel7.0-2/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -device usb-tablet,id=input0 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on

Comment 4 Eduardo Habkost 2016-06-30 15:03:34 UTC
(In reply to Timothy Rees from comment #0)
[...]
> Test with 2 scenarios:
> A: Guest cpu configured with mode=host-passthrough
> B: Guest cpu configured with mode=custom match=exact
[...]
> SCENARIO A:
[...]
> 3. Boot and login to the guest:
> [root@guestvm ~]# lscpu
[...]
> CPU(s):                120
> On-line CPU(s) list:   0-119
> Thread(s) per core:    1
> Core(s) per socket:    16
> Socket(s):             4
> NUMA node(s):          4
[...]
> Actual results:
> Guest shows Sockets/Cores/Threads as 4/16/1
> 
> Expected results:
> Per the host output and xml configuration, guest should show
> Sockets/Cores/Threads as 4/15/2

This is unexpected, I will investigate.

But note that each socket still have 15 cores each. The only
difference is that the guest assumes that the _maximum_ number of
cores per socket is 16, the _actual_ number of cores per socket
is exactly the one requested.

Core ID 15 won't be present in any of the sockets, only core IDs
0-14. This way, the CPU core topology and NUMA topology still
match the host as expected.

I would like to understand what are the real world problems that
are caused by this behavior. This way, we can look for
workarounds if necessary, and will be able to justify changing
QEMU behavior upstream.

> 
> SCENARIO B:
> 3. Boot and login to the guest:
> [root@guestvm ~]# lscpu
[...]
> CPU(s):                120
> On-line CPU(s) list:   0-119
> Thread(s) per core:    1
> Core(s) per socket:    16
> Socket(s):             4
> NUMA node(s):          4
[...]
> Actual results:
> Guest shows Sockets/Cores/Threads as 4/16/1

See comment about scenario A above.


> Guest CPU Model does not match host CPU Model
> Guest CPU Caches do not match host CPU caches
>

This is expected. CPU model will match exactly only if using
host-passthrough. CPU caches will match the host only if using
"-cpu host" (host-passthrough on libvirt) and
"host-cache-info=on" (not supported by libvirt yet).

Note: Old versions of QEMU (including the current 7.2
qemu-kvm-rhev version) had a bug that made it forward host CPU
cache information unconditionally, but this will be fixed in 7.3.
See bug 1184125

Like on scenario A, I would like to understand what are the real
world problems that are caused by this. This way, we can look for
workarounds if necessary.

Comment 5 Eduardo Habkost 2016-07-02 00:13:01 UTC
(In reply to Eduardo Habkost from comment #4)
> (In reply to Timothy Rees from comment #0)
[...]
> > Actual results:
> > Guest shows Sockets/Cores/Threads as 4/16/1
> > 
> > Expected results:
> > Per the host output and xml configuration, guest should show
> > Sockets/Cores/Threads as 4/15/2
> 
> This is unexpected, I will investigate.
> 

I have reproduced the issue, and noticed that the configuration shown on the bug description is confusing lscpu because you are placing CPU threads from the same core in different NUMA nodes.

In other words, this:

    <topology sockets='4' cores='15' threads='2'/>

means that CPUs 0-29 will be in socket 0, 30-59 in socket 1, 60-89 in socket 2, 90-119 in socket 3.

Inside socket 0:
CPUs 0-1 will be in core 0
CPUs 2-3 will be in core 1
[...]
CPUs 12-13 will be in core 6
CPUs 14-15 will be in core 7
[...]
CPUs 26-27 will be in core 13
CPUs 28-29 will be in core 14

The same pattern will repeat in the other sockets.

(Note that the CPU numbers above won't match the host because the CPU numbers in the host are arbitrary. We order CPUs in the ACPI tables by socket/core/thread IDs (more exactly, by APIC ID, which encodes the socket/core/thread IDs), and the host seems to order them in a different way. You can check what's the exact ordering using 'lscpu -e' in the host.)

However:

    <numa>
      <cell id='0' cpus='0-14,60-74' memory='131072000' unit='KiB'/>
      <cell id='1' cpus='15-29,75-89' memory='131072000' unit='KiB'/>
      <cell id='2' cpus='30-44,90-104' memory='131072000' unit='KiB'/>
      <cell id='3' cpus='45-59,105-119' memory='121072640' unit='KiB'/>
    </numa>

will place one thread of core 7 (CPU14) in NUMA node 0, and another thread (CPU 15) in NUMA node 1. This configuration doesn't make sense.

I don't know what's the exact host NUMA topology in the host, but I assume you want to place all cores from the same socket inside the same NUMA node. Probably something like:

    <numa>
      <cell id='0' cpus='0-29' memory='131072000' unit='KiB'/>
      <cell id='1' cpus='30-59' memory='131072000' unit='KiB'/>
      <cell id='2' cpus='60-89' memory='131072000' unit='KiB'/>
      <cell id='3' cpus='90-119' memory='121072640' unit='KiB'/>
    </numa>

To confirm you are reproducing exactly the same topology from the host, you can compare the data shown in 'lscpu -e' in the host and in the guest (ignoring the first column, that is an arbitrary CPU number).

Comment 6 Eduardo Habkost 2016-07-02 00:55:35 UTC
Discussion was started in qemu-devel to see if we can provide mechanisms to make this less confusing to configure:
http://article.gmane.org/gmane.comp.emulators.qemu/423961


Note You need to log in before you can comment on or make changes to this bug.