1987329 – Guest reports incorrect CPU topology when pinned and specifying CPU_THREADS

Bug 1987329 - Guest reports incorrect CPU topology when pinned and specifying CPU_THREADS

Summary: Guest reports incorrect CPU topology when pinned and specifying CPU_THREADS

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	2.6.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.1
Assignee:	Roman Mohr
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-29 14:02 UTC by djdumas
Modified:	2021-12-13 19:59 UTC (History)
CC List:	9 users (show)
Fixed In Version:	virt-operator-container-v4.9.1-2 hco-bundle-registry-container-v4.9.1-2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-13 19:59:01 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
yaml used including pinning (3.73 KB, text/plain) 2021-07-29 15:32 UTC, djdumas	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 6392	0	None	open	[release-0.44] Ensure optimal CPU pinning with dedicated CPUs	2021-09-13 11:16:07 UTC
Red Hat Product Errata	RHBA-2021:5091	0	None	None	None	2021-12-13 19:59:17 UTC

Internal Links: 2109255

Description djdumas 2021-07-29 14:02:25 UTC

Description of problem:
we are seeing differences in the way a guest reports its own cpu topology when pinned, in this case we are using "sockets: 4, cores: 26, threads: 2"
 
When pinning (i.e. isolateEmulatorThread, dedicatedCpuPlacement, numa guestMappingPassthrough) we see that the VM yaml, xml, and qemu cmdline all show cores=26 threads=2, however the guest reports:
Thread(s) per core:  1
Core(s) per socket:  52
Socket(s):           4

If we use the same definition but remove all pinning features, topo is as expected:
Thread(s) per core:  2
Core(s) per socket:  26
Socket(s):           4

Version-Release number of selected component (if applicable):
Client Version: 4.7.2
Server Version: 4.7.2
Kubernetes Version: v1.20.0+5fbfd19

worker kernel: 4.18.0-240.15.1.el8_3.x86_64

How reproducible: always

Steps to Reproduce:
1.Create a guest with pinning (isolateEmulatorThread, dedicatedCpuPlacement, numa guestMappingPassthrough) and specify cpu topology as CPU_SOCKETS="4" CPU_CORES="26" CPU_THREADS="2" 
2. Verify cpu topology in the guest with lscpu - incorrect topology (core and thread count)
3. Remove the pinning, check with lscpu, correct topology

Actual results: incorrect cpu topology when pinning


Expected results: correct cpu topology


Additional info:

Comment 1 Roman Mohr 2021-07-29 14:29:31 UTC

David, could you share the domain xml and the yaml you were using?

Comment 2 Jenifer Abrams 2021-07-29 14:40:53 UTC

Note we are reserving 1 core per numa node on this worker: 
  kubeletConfig:
     cpuManagerPolicy: static
     cpuManagerReconcilePeriod: 5s
            reservedSystemCPUs: "0,112,1,113,2,114,3,115"

Comment 3 Jenifer Abrams 2021-07-29 14:43:45 UTC

Here is an xml snippet I saved the last time Dave booted the pinned guest that reports 1 thread:

VM yaml snippet:
          cores: 26
          dedicatedCpuPlacement: true
          features:
          - name: invtsc
            policy: require
          isolateEmulatorThread: true
          model: host-passthrough
          numa:
            guestMappingPassthrough: {}
          sockets: 4
          threads: 2

virt-launcher pod xml itself:
    <topology sockets='4' dies='1' cores='26' threads='2'/>

Comment 4 Jenifer Abrams 2021-07-29 15:16:43 UTC

From virt-launcher, full XML when the pinned guest was running:
http://perf1.perf.lab.eng.bos.redhat.com/pub/jhopper/CNV/debug/BZ1987329/pinned-guest.xml

virsh capabilities for the host node:
http://perf1.perf.lab.eng.bos.redhat.com/pub/jhopper/CNV/debug/BZ1987329/virsh_cap.xml

Host topo:
CPU(s):              224
On-line CPU(s) list: 0-223
Thread(s) per core:  2
Core(s) per socket:  28
Socket(s):           4
NUMA node(s):        4

Comment 5 djdumas 2021-07-29 15:32:50 UTC

Created attachment 1807948 [details]
yaml used including pinning

Comment 6 djdumas 2021-07-29 15:34:43 UTC

(In reply to Roman Mohr from comment #1)
> David, could you share the domain xml and the yaml you were using?

Hi Roman, Jenifer already provided the xml, yaml is provided as attachment

Comment 7 Vladik Romanovsky 2021-08-03 13:39:00 UTC

Daniel, maybe you can take a look, https://bugzilla.redhat.com/show_bug.cgi?id=1987329#c4 has the full libvirt XML and virt capabilities from the host.

The topology in the XML is set to: 

<cpu mode="host-passthrough" check="none" migratable="on">
<topology sockets="4" dies="1" cores="26" threads="2"/>


When the following physical CPUs are masked from the virt-launcher container  
"0,112,1,113,2,114,3,115"

The topology in the guest (rhel8) shows up as:

Thread(s) per core:  1
Core(s) per socket:  52
Socket(s):           4

Comment 8 Daniel Berrangé 2021-08-03 13:50:11 UTC

(In reply to djdumas from comment #6)
> (In reply to Roman Mohr from comment #1)
> > David, could you share the domain xml and the yaml you were using?
> 
> Hi Roman, Jenifer already provided the xml, yaml is provided as attachment

The XML config provided only shows the VM where pinning is used.  The original description indicates the behaviour of the guest is different from a non-pinned scenario. We thus need to also see the XML config for the non-pinned case that is being compared with.

It is also desirable to see the /var/log/libvirt/qemu/$GUEST.log file for the pinned and non-pinned guests.

Finally can you confirm that the *exact* same physical host is used for the 2 VMs being compared, and what is the guest OS in question ?

Comment 12 djdumas 2021-08-03 19:42:30 UTC

(In reply to Daniel Berrangé from comment #8)
> (In reply to djdumas from comment #6)
> > (In reply to Roman Mohr from comment #1)
> > > David, could you share the domain xml and the yaml you were using?
> > 
> > Hi Roman, Jenifer already provided the xml, yaml is provided as attachment
> 
> The XML config provided only shows the VM where pinning is used.  The
> original description indicates the behaviour of the guest is different from
> a non-pinned scenario. We thus need to also see the XML config for the
> non-pinned case that is being compared with.
> 
> It is also desirable to see the /var/log/libvirt/qemu/$GUEST.log file for
> the pinned and non-pinned guests.

See attachments above for no-pinning xml, pinning and no-pinning logs
> 
> Finally can you confirm that the *exact* same physical host is used for the
> 2 VMs being compared, and what is the guest OS in question ?

Yes, these are from the same physical host.
Guest OS is rhel 4.18.0-193.56.1.el8_2

Comment 13 Daniel Berrangé 2021-08-05 10:52:19 UTC

There's no configuration in QEMU that would explain this difference.

Can you report "hwloc-ls" and /proc/cpuinfo  contents from the guest for both pinned and unpinned case

Comment 19 Daniel Berrangé 2021-08-05 14:48:48 UTC

Ok, so based on this guest info and especially the error message from hwloc, I'm thinking there must be a bug in the way QEMU is exposing CPU topology information. I struggle to understand how CPU pinning can trigger such a bug, but it none the less seems to exist.

The libvirt log file shows a mixture of fairly old software versions.

I think the next step is to reproduce with latest released RHEL 8.4 kernel instead of outdated 8.3 kernel, along with RHEL-8.4  qemu-kvm package from RHEL-AV repos, instead of the Fedora qemu build

If the pure RHEL-8.4 virt host shows the error, then assign the bug to RHEL-AV product + qemu-kvm component for investigation by QEMU maintainers.

Comment 21 Jenifer Abrams 2021-08-06 19:50:45 UTC

Does this still happen w/out the numa passthrough?
            numa:
              guestMappingPassthrough : {}

I think I see a potential issue in the threads=2 case:

Due to kubelet reservedCpus (0,112,1,113,2,114,3,115) and other pods running on other cpus, this is the cpuset the numa pinned VM's pod gets:
4-102,104-106,108-110,116-210,212-214,216-218,220-222

which gives us from the host:
N0: 54 cpus
N1: 54 cpus
N2: 54 cpus
N3: 49 cpus

guest definition:

      <cell id='0' cpus='0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,99,102,105,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165,169,173,177,181,185,189,193,197,200,203,206' memory='728760320' unit='KiB'/>
      <cell id='1' cpus='1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,100,103,106,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166,170,174,178,182,186,190,194,198,201,204,207' memory='728760320' unit='KiB'/>
      <cell id='2' cpus='2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,101,104,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167,171,175,179,183,187,191,195,199,202,205' memory='728760320' unit='KiB'/>
      <cell id='3' cpus='3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188,192,196' memory='728760320' unit='KiB'/>

cell0 is 54 cpus
cell1 is 54 cpus
cell2 is 53 cpus -- odd number
cell3 is 47 cpus -- odd number

Also, the guest cpu/cell alignment looks strange after cpu98:

guest cells:
0:  92   96   99  102
1:  93   97  100  103
2:  94   98  101  104
3:  95  108  112  116

which maps to these host cpus:
0:  96  100  104  108
1:  97  101  105  109
2:  98  102  106  110
3:  99  119  123  127

It seems like its getting out of line once it reached the first single cpu gap in the pod cpuset at host cpu 103?

Comment 22 Jenifer Abrams 2021-08-06 21:34:13 UTC

> which maps to these host cpus:
> 0:  96  100  104  108
> 1:  97  101  105  109
> 2:  98  102  106  110
> 3:  99  119  123  127

Actually now I see 119 is just the next cpu in N3 avail in the cpuset, but still wondering if the odd number of cpus is a problem?

Comment 24 djdumas 2021-08-09 21:30:14 UTC

(In reply to Jenifer Abrams from comment #21)
> Does this still happen w/out the numa passthrough?
>             numa:
>               guestMappingPassthrough : {}
> 

When numa passthrough is removed, the sockets/cores/threads information is correct.

[root@hanavirt52 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              208
On-line CPU(s) list: 0-207
Thread(s) per core:  2
Core(s) per socket:  26
Socket(s):           4
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8280M CPU @ 2.70GHz
Stepping:            7
CPU MHz:             2693.670
BogoMIPS:            5387.34
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-207
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities

Comment 25 Eduardo Habkost 2021-08-10 14:54:40 UTC

(In reply to djdumas from comment #24)
> (In reply to Jenifer Abrams from comment #21)
> > Does this still happen w/out the numa passthrough?
> >             numa:
> >               guestMappingPassthrough : {}
> > 
> 
> When numa passthrough is removed, the sockets/cores/threads information is
> correct.

I will try to take a closer look at the logs, but if changing the VM NUMA topology changes behavior, it probably means you are creating a VM configuration where a CPU core is split between two different NUMA nodes (and this makes some guest software treat them as two distinct cores).  I remember seeing 'lscpu', specifically, getting very confused when the NUMA topology didn't match the CPU core topology.

Comment 26 Eduardo Habkost 2021-08-10 15:08:32 UTC

(In reply to Jenifer Abrams from comment #21)
> guest definition:
> 
>       <cell id='0'
> cpus='0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,
> 96,99,102,105,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165,
> 169,173,177,181,185,189,193,197,200,203,206' memory='728760320' unit='KiB'/>
>       <cell id='1'
> cpus='1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,
> 97,100,103,106,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166,
> 170,174,178,182,186,190,194,198,201,204,207' memory='728760320' unit='KiB'/>
>       <cell id='2'
> cpus='2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,
> 98,101,104,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167,
> 171,175,179,183,187,191,195,199,202,205' memory='728760320' unit='KiB'/>
>       <cell id='3'
> cpus='3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,
> 108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,
> 184,188,192,196' memory='728760320' unit='KiB'/>

If this is a cores=26,thread=2 guest, you are splitting every single CPU core into two separate NUMA nodes.

CPUs 0-1 are in the same core but different NUMA nodes, which makes the guest assume they are actually two separate cores with 1 thread each.  The same for CPUs 2-3, 4-5, 6-7, etc.  This may cause all kinds of confusion and inconsistencies in guest software because it never happens in real hardware.

Comment 27 Daniel Berrangé 2021-08-10 16:16:42 UTC

(In reply to Eduardo Habkost from comment #26)
> (In reply to Jenifer Abrams from comment #21)
> If this is a cores=26,thread=2 guest, you are splitting every single CPU
> core into two separate NUMA nodes.
> 
> CPUs 0-1 are in the same core but different NUMA nodes, which makes the
> guest assume they are actually two separate cores with 1 thread each.  The
> same for CPUs 2-3, 4-5, 6-7, etc.  This may cause all kinds of confusion and
> inconsistencies in guest software because it never happens in real hardware.

Regardless of whether this config is sensible or not though, there still looks like a QEMU bug here.  Merely pinning a guest CPU to a specific host CPU  should not be affecting what topology is exposed to the guest, as that's merely a host side tuning knob, not guest ABI. It feels like there is something here not getting virtualized correctly and accidentely exposing some aspect of the host CPU that varies depending on which host CPU we happen to be placed on

Comment 28 Eduardo Habkost 2021-08-10 17:38:42 UTC

(In reply to Daniel Berrangé from comment #27)
> (In reply to Eduardo Habkost from comment #26)
> > (In reply to Jenifer Abrams from comment #21)
> > If this is a cores=26,thread=2 guest, you are splitting every single CPU
> > core into two separate NUMA nodes.
> > 
> > CPUs 0-1 are in the same core but different NUMA nodes, which makes the
> > guest assume they are actually two separate cores with 1 thread each.  The
> > same for CPUs 2-3, 4-5, 6-7, etc.  This may cause all kinds of confusion and
> > inconsistencies in guest software because it never happens in real hardware.
> 
> Regardless of whether this config is sensible or not though, there still
> looks like a QEMU bug here.  Merely pinning a guest CPU to a specific host
> CPU  should not be affecting what topology is exposed to the guest, as
> that's merely a host side tuning knob, not guest ABI. It feels like there is
> something here not getting virtualized correctly and accidentely exposing
> some aspect of the host CPU that varies depending on which host CPU we
> happen to be placed on

I would agree if merely changing CPU pinning configuration is affecting the guest topology, but that's not what I saw on the log files at comment #10 and comment #11.

default_sapvm1_pinning.log has:

-numa node,nodeid=0,cpus=0,cpus=4,cpus=8,cpus=12,cpus=16,cpus=20,cpus=24,cpus=28,cpus=32,cpus=36,cpus=40,cpus=44,cpus=48,cpus=52,cpus=56,cpus=60,cpus=64,cpus=68,cpus=72,cpus=76,cpus=80,cpus=84,cpus=88,cpus=92,cpus=96,cpus=99,cpus=102,cpus=105,cpus=109,cpus=113,cpus=117,cpus=121,cpus=125,cpus=129,cpus=133,cpus=137,cpus=141,cpus=145,cpus=149,cpus=153,cpus=157,cpus=161,cpus=165,cpus=169,cpus=173,cpus=177,cpus=181,cpus=185,cpus=189,cpus=193,cpus=197,cpus=200,cpus=203,cpus=206,memdev=ram-node0 \
-object memory-backend-memfd,id=ram-node1,hugetlb=yes,hugetlbsize=1073741824,prealloc=yes,size=746250567680,host-nodes=1,policy=bind \
-numa node,nodeid=1,cpus=1,cpus=5,cpus=9,cpus=13,cpus=17,cpus=21,cpus=25,cpus=29,cpus=33,cpus=37,cpus=41,cpus=45,cpus=49,cpus=53,cpus=57,cpus=61,cpus=65,cpus=69,cpus=73,cpus=77,cpus=81,cpus=85,cpus=89,cpus=93,cpus=97,cpus=100,cpus=103,cpus=106,cpus=110,cpus=114,cpus=118,cpus=122,cpus=126,cpus=130,cpus=134,cpus=138,cpus=142,cpus=146,cpus=150,cpus=154,cpus=158,cpus=162,cpus=166,cpus=170,cpus=174,cpus=178,cpus=182,cpus=186,cpus=190,cpus=194,cpus=198,cpus=201,cpus=204,cpus=207,memdev=ram-node1 \
-object memory-backend-memfd,id=ram-node2,hugetlb=yes,hugetlbsize=1073741824,prealloc=yes,size=746250567680,host-nodes=2,policy=bind \
-numa node,nodeid=2,cpus=2,cpus=6,cpus=10,cpus=14,cpus=18,cpus=22,cpus=26,cpus=30,cpus=34,cpus=38,cpus=42,cpus=46,cpus=50,cpus=54,cpus=58,cpus=62,cpus=66,cpus=70,cpus=74,cpus=78,cpus=82,cpus=86,cpus=90,cpus=94,cpus=98,cpus=101,cpus=104,cpus=107,cpus=111,cpus=115,cpus=119,cpus=123,cpus=127,cpus=131,cpus=135,cpus=139,cpus=143,cpus=147,cpus=151,cpus=155,cpus=159,cpus=163,cpus=167,cpus=171,cpus=175,cpus=179,cpus=183,cpus=187,cpus=191,cpus=195,cpus=199,cpus=202,cpus=205,memdev=ram-node2 \
-object memory-backend-memfd,id=ram-node3,hugetlb=yes,hugetlbsize=1073741824,prealloc=yes,size=746250567680,host-nodes=3,policy=bind \
-numa node,nodeid=3,cpus=3,cpus=7,cpus=11,cpus=15,cpus=19,cpus=23,cpus=27,cpus=31,cpus=35,cpus=39,cpus=43,cpus=47,cpus=51,cpus=55,cpus=59,cpus=63,cpus=67,cpus=71,cpus=75,cpus=79,cpus=83,cpus=87,cpus=91,cpus=95,cpus=108,cpus=112,cpus=116,cpus=120,cpus=124,cpus=128,cpus=132,cpus=136,cpus=140,cpus=144,cpus=148,cpus=152,cpus=156,cpus=160,cpus=164,cpus=168,cpus=172,cpus=176,cpus=180,cpus=184,cpus=188,cpus=192,cpus=196,memdev=ram-node3 \
[...]


default_sapvm1_nopinning.log has:

-numa node,nodeid=0,cpus=0-207,memdev=ram-node0 \
[...]

Comment 29 Daniel Berrangé 2021-08-11 08:52:18 UTC

(In reply to Eduardo Habkost from comment #28)
> (In reply to Daniel Berrangé from comment #27)
> > (In reply to Eduardo Habkost from comment #26)
> > Regardless of whether this config is sensible or not though, there still
> > looks like a QEMU bug here.  Merely pinning a guest CPU to a specific host
> > CPU  should not be affecting what topology is exposed to the guest, as
> > that's merely a host side tuning knob, not guest ABI. It feels like there is
> > something here not getting virtualized correctly and accidentely exposing
> > some aspect of the host CPU that varies depending on which host CPU we
> > happen to be placed on
> 
> I would agree if merely changing CPU pinning configuration is affecting the
> guest topology, but that's not what I saw on the log files at comment #10
> and comment #11.

Oh, my bad, I mis-interpreted the differences.

Comment 30 djdumas 2021-08-11 20:19:57 UTC

I'm attaching a spreadsheet that may help show a CPU configuration problem that is more than just the use of threads.

The spreadsheet covers the 3 types of configurations we've tried - 214s/1c/1t, 4s/53c/1t, and 4s/26c/2t.
The NUMA definition from the xml is only correct in the 214s/1c/1t case.

If you scroll down the vcpupin for the 214s case you'll see where the cpuset numbers are no longer sequential (highlighted in yellow). That's not a problem - look over to the next column - the numa definition from the xml - and you'll see that there is no corresponding gap in the guest lscpu information.  All good, the guest lscpu matches the host lscpu up to 214 vCPUs.

Do the same exercise on the 4s-53c-1t tab (scroll down vcpupin column until cpuset 110/116 highlighted in yellow).  It looks like this is being carried over to the numa definition which results in incorrect numa definitions from cpuset 110 and on.

Comment 32 Eduardo Habkost 2021-08-11 20:43:45 UTC

(In reply to djdumas from comment #30)
> I'm attaching a spreadsheet that may help show a CPU configuration problem
> that is more than just the use of threads.
> 
> The spreadsheet covers the 3 types of configurations we've tried -
> 214s/1c/1t, 4s/53c/1t, and 4s/26c/2t.
> The NUMA definition from the xml is only correct in the 214s/1c/1t case.
> 
> If you scroll down the vcpupin for the 214s case you'll see where the cpuset
> numbers are no longer sequential (highlighted in yellow). That's not a
> problem - look over to the next column - the numa definition from the xml -
> and you'll see that there is no corresponding gap in the guest lscpu
> information.  All good, the guest lscpu matches the host lscpu up to 214
> vCPUs.
> 
> Do the same exercise on the 4s-53c-1t tab (scroll down vcpupin column until
> cpuset 110/116 highlighted in yellow).  It looks like this is being carried
> over to the numa definition which results in incorrect numa definitions from
> cpuset 110 and on.

If the "numa definition from xml (cpuset)" section on 4s-53c-1t and 4s-26c-2t reflect the actual <cell cpus='...'> values, the domain XML really seems wrong.  Are VCPUs 0-3 completely missing from the <cell> elements in the XML?

Comment 33 djdumas 2021-08-11 20:57:39 UTC

(In reply to Eduardo Habkost from comment #32)
> (In reply to djdumas from comment #30)
> > I'm attaching a spreadsheet that may help show a CPU configuration problem
> > that is more than just the use of threads.
> > 
> > The spreadsheet covers the 3 types of configurations we've tried -
> > 214s/1c/1t, 4s/53c/1t, and 4s/26c/2t.
> > The NUMA definition from the xml is only correct in the 214s/1c/1t case.
> > 
> > If you scroll down the vcpupin for the 214s case you'll see where the cpuset
> > numbers are no longer sequential (highlighted in yellow). That's not a
> > problem - look over to the next column - the numa definition from the xml -
> > and you'll see that there is no corresponding gap in the guest lscpu
> > information.  All good, the guest lscpu matches the host lscpu up to 214
> > vCPUs.
> > 
> > Do the same exercise on the 4s-53c-1t tab (scroll down vcpupin column until
> > cpuset 110/116 highlighted in yellow).  It looks like this is being carried
> > over to the numa definition which results in incorrect numa definitions from
> > cpuset 110 and on.
> 
> If the "numa definition from xml (cpuset)" section on 4s-53c-1t and
> 4s-26c-2t reflect the actual <cell cpus='...'> values, the domain XML really
> seems wrong.  Are VCPUs 0-3 completely missing from the <cell> elements in
> the XML?

No, vcpus 0-3 are there - sorry I omitted them somehow, but the rest is correct.  I'll replace the attachment.
The following is from the 4s/53c/1t xml

    <numa>
      <cell id='0' cpus='0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167,171,175,179,183,187,191,195,199,203,207,210' memory='728760320' unit='KiB'/>
      <cell id='1' cpus='1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188,192,196,200,204,208,211' memory='728760320' unit='KiB'/>
      <cell id='2' cpus='2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,102,106,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165,169,173,177,181,185,189,193,197,201,205,209' memory='728760320' unit='KiB'/>
      <cell id='3' cpus='3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,99,103,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166,170,174,178,182,186,190,194,198,202,206' memory='728760320' unit='KiB'/>

Comment 35 Jenifer Abrams 2021-08-11 21:48:59 UTC

On the 2 different issues:

1) As Dave pointed out, changing the guest topology from 'sockets:214 cores:1 threads:1' to 'sockets:4 cores:53 threads:1' results in that strange numbering in the guest numa cells 

              214s (214 total):     4s 53c (212 total):
guest cell N0:   100   104   108       100   104   107
guest cell N1:   101   105   109       101   105   108
guest cell N2:   102   106   110       102   106   109
guest cell N3:   103   107   111       103   110   114
========================================================
      host N0:   104   108   116       104   108   116
      host N0:   105   109   117       105   109   117
      host N0:   106   110   118       106   110   118
      host N0:   107   111   119       107   119   123

I am guessing the host numa pinning is correct for 4s53c if one of the 2 less cpus in the pod cpuset means host cpu 111 was not included, but why is vcpu 107 not mapped to host cpu 119? Maybe it doesn't actually matter if things are numa-aligned but the vcpu ordering is a bit confusing.

2) As Eduardo said, currently the numa pinning logic won't work for threads:2 since it pins across numa nodes:
vcpu='0'	cpuset='4'/>
vcpu='1'	cpuset='5'/>
vcpu='2'	cpuset='6'/>
vcpu='3'	cpuset='7'/>

In past KVM testing w/ numa pinning the guest [0,1][2,3] sibling pairs are pinned to physical siblings like so, and it did not start pinning across nodes, it filled all of N0 first, then N1 and so on, ex:
<vcpupin    vcpu='0'    cpuset='4'/>
<vcpupin    vcpu='1'    cpuset='116'/>
<vcpupin    vcpu='2'    cpuset='8'/>
<vcpupin    vcpu='3'    cpuset='120'/>

Comment 36 Roman Mohr 2021-08-12 06:40:55 UTC

(In reply to Eduardo Habkost from comment #26)
> (In reply to Jenifer Abrams from comment #21)
> > guest definition:
> > 
> >       <cell id='0'
> > cpus='0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,
> > 96,99,102,105,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165,
> > 169,173,177,181,185,189,193,197,200,203,206' memory='728760320' unit='KiB'/>
> >       <cell id='1'
> > cpus='1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,
> > 97,100,103,106,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166,
> > 170,174,178,182,186,190,194,198,201,204,207' memory='728760320' unit='KiB'/>
> >       <cell id='2'
> > cpus='2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,
> > 98,101,104,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167,
> > 171,175,179,183,187,191,195,199,202,205' memory='728760320' unit='KiB'/>
> >       <cell id='3'
> > cpus='3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,
> > 108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,
> > 184,188,192,196' memory='728760320' unit='KiB'/>
> 
> If this is a cores=26,thread=2 guest, you are splitting every single CPU
> core into two separate NUMA nodes.
> 
> CPUs 0-1 are in the same core but different NUMA nodes, which makes the
> guest assume they are actually two separate cores with 1 thread each.  The
> same for CPUs 2-3, 4-5, 6-7, etc.  This may cause all kinds of confusion and
> inconsistencies in guest software because it never happens in real hardware.

If this is the issue, then this is a clear bug in my pinning logic. That is not my intent. Working on it. Thanks.

Comment 38 Roman Mohr 2021-09-13 09:44:10 UTC

Merged on main: https://github.com/kubevirt/kubevirt/pull/6251

Comment 39 Roman Mohr 2021-09-13 11:15:44 UTC

Backport is filed: https://github.com/kubevirt/kubevirt/pull/6392

Comment 41 djdumas 2021-09-14 13:41:32 UTC

After the fix, lscpu now shows the correct sockets/cores/threads information in the guest:

[root@hanavirt52 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              208
On-line CPU(s) list: 0-207
Thread(s) per core:  2
Core(s) per socket:  26
Socket(s):           4
NUMA node(s):        4
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8280M CPU @ 2.70GHz
Stepping:            7
CPU MHz:             2693.670
BogoMIPS:            5387.34
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-53
NUMA node1 CPU(s):   54-107
NUMA node2 CPU(s):   108-161
NUMA node3 CPU(s):   162-207
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities

Comment 43 Kedar Bidarkar 2021-11-09 11:30:09 UTC

NOTE: My BM setup had 64 CPU's per node.

With 2 Sockets
---------------

[kbidarka@localhost ocs]$ oc get vmi vm-rhel84-ocs-numa -o yaml | grep -A 10 "cpu:"
--
    cpu:
      cores: 15
      dedicatedCpuPlacement: true
      isolateEmulatorThread: true
      model: host-passthrough
      numa:
        guestMappingPassthrough: {}
      sockets: 2
      threads: 2

Guest Login:
------------------
Last login: Tue Nov  9 06:01:56 on ttyS0
[cloud-user@vm-rhel84-ocs-numa ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              60
On-line CPU(s) list: 0-59
Thread(s) per core:  2
Core(s) per socket:  15
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
Stepping:            7
CPU MHz:             2095.076
BogoMIPS:            4190.15
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-27
NUMA node1 CPU(s):   28-59
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
[cloud-user@vm-rhel84-ocs-numa ~]$ 

---------------------------------------------
With 4 sockets
-----------------

[kbidarka@localhost ocs]$ oc get vmi vm-rhel84-ocs-numa -o yaml | grep -A 10 "cpu:"
--
    cpu:
      cores: 7
      dedicatedCpuPlacement: true
      isolateEmulatorThread: true
      model: host-passthrough
      numa:
        guestMappingPassthrough: {}
      sockets: 4
      threads: 2

Guest Login
----------------

[cloud-user@vm-rhel84-ocs-numa ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              56
On-line CPU(s) list: 0-55
Thread(s) per core:  2
Core(s) per socket:  7
Socket(s):           4
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
Stepping:            7
CPU MHz:             2095.076
BogoMIPS:            4190.15
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-55
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
[cloud-user@vm-rhel84-ocs-numa ~]$ [kbidarka@localhost ocs]$ 


Summary: 
1) The bug indeed appears to be fixed, as we see the correct, socket, thread, core count, when using CPU-Pinning features.
2) NOTE, 'reservedSystemCPUs: "0,112,1,113,2,114,3,115"' was not set in KubeletConfig "cpumanager-enabled". Hope this is ok, here?

Comment 44 Kedar Bidarkar 2021-11-09 11:43:19 UTC

VERIFIED with virt-operator-container-v4.9.1-4

Comment 45 Roman Mohr 2021-11-09 11:59:59 UTC

> 2) NOTE, 'reservedSystemCPUs: "0,112,1,113,2,114,3,115"' was not set in KubeletConfig "cpumanager-enabled". Hope this is ok, here?


Yes that should be fine. We would have seen the wrong topology either way.

Comment 51 errata-xmlrpc 2021-12-13 19:59:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.9.1 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:5091

Note You need to log in before you can comment on or make changes to this bug.