Description of problem: we are seeing differences in the way a guest reports its own cpu topology when pinned, in this case we are using "sockets: 4, cores: 26, threads: 2" When pinning (i.e. isolateEmulatorThread, dedicatedCpuPlacement, numa guestMappingPassthrough) we see that the VM yaml, xml, and qemu cmdline all show cores=26 threads=2, however the guest reports: Thread(s) per core: 1 Core(s) per socket: 52 Socket(s): 4 If we use the same definition but remove all pinning features, topo is as expected: Thread(s) per core: 2 Core(s) per socket: 26 Socket(s): 4 Version-Release number of selected component (if applicable): Client Version: 4.7.2 Server Version: 4.7.2 Kubernetes Version: v1.20.0+5fbfd19 worker kernel: 4.18.0-240.15.1.el8_3.x86_64 How reproducible: always Steps to Reproduce: 1.Create a guest with pinning (isolateEmulatorThread, dedicatedCpuPlacement, numa guestMappingPassthrough) and specify cpu topology as CPU_SOCKETS="4" CPU_CORES="26" CPU_THREADS="2" 2. Verify cpu topology in the guest with lscpu - incorrect topology (core and thread count) 3. Remove the pinning, check with lscpu, correct topology Actual results: incorrect cpu topology when pinning Expected results: correct cpu topology Additional info:
David, could you share the domain xml and the yaml you were using?
Note we are reserving 1 core per numa node on this worker: kubeletConfig: cpuManagerPolicy: static cpuManagerReconcilePeriod: 5s reservedSystemCPUs: "0,112,1,113,2,114,3,115"
Here is an xml snippet I saved the last time Dave booted the pinned guest that reports 1 thread: VM yaml snippet: cores: 26 dedicatedCpuPlacement: true features: - name: invtsc policy: require isolateEmulatorThread: true model: host-passthrough numa: guestMappingPassthrough: {} sockets: 4 threads: 2 virt-launcher pod xml itself: <topology sockets='4' dies='1' cores='26' threads='2'/>
From virt-launcher, full XML when the pinned guest was running: http://perf1.perf.lab.eng.bos.redhat.com/pub/jhopper/CNV/debug/BZ1987329/pinned-guest.xml virsh capabilities for the host node: http://perf1.perf.lab.eng.bos.redhat.com/pub/jhopper/CNV/debug/BZ1987329/virsh_cap.xml Host topo: CPU(s): 224 On-line CPU(s) list: 0-223 Thread(s) per core: 2 Core(s) per socket: 28 Socket(s): 4 NUMA node(s): 4
Created attachment 1807948 [details] yaml used including pinning
(In reply to Roman Mohr from comment #1) > David, could you share the domain xml and the yaml you were using? Hi Roman, Jenifer already provided the xml, yaml is provided as attachment
Daniel, maybe you can take a look, https://bugzilla.redhat.com/show_bug.cgi?id=1987329#c4 has the full libvirt XML and virt capabilities from the host. The topology in the XML is set to: <cpu mode="host-passthrough" check="none" migratable="on"> <topology sockets="4" dies="1" cores="26" threads="2"/> When the following physical CPUs are masked from the virt-launcher container "0,112,1,113,2,114,3,115" The topology in the guest (rhel8) shows up as: Thread(s) per core: 1 Core(s) per socket: 52 Socket(s): 4
(In reply to djdumas from comment #6) > (In reply to Roman Mohr from comment #1) > > David, could you share the domain xml and the yaml you were using? > > Hi Roman, Jenifer already provided the xml, yaml is provided as attachment The XML config provided only shows the VM where pinning is used. The original description indicates the behaviour of the guest is different from a non-pinned scenario. We thus need to also see the XML config for the non-pinned case that is being compared with. It is also desirable to see the /var/log/libvirt/qemu/$GUEST.log file for the pinned and non-pinned guests. Finally can you confirm that the *exact* same physical host is used for the 2 VMs being compared, and what is the guest OS in question ?
(In reply to Daniel Berrangé from comment #8) > (In reply to djdumas from comment #6) > > (In reply to Roman Mohr from comment #1) > > > David, could you share the domain xml and the yaml you were using? > > > > Hi Roman, Jenifer already provided the xml, yaml is provided as attachment > > The XML config provided only shows the VM where pinning is used. The > original description indicates the behaviour of the guest is different from > a non-pinned scenario. We thus need to also see the XML config for the > non-pinned case that is being compared with. > > It is also desirable to see the /var/log/libvirt/qemu/$GUEST.log file for > the pinned and non-pinned guests. See attachments above for no-pinning xml, pinning and no-pinning logs > > Finally can you confirm that the *exact* same physical host is used for the > 2 VMs being compared, and what is the guest OS in question ? Yes, these are from the same physical host. Guest OS is rhel 4.18.0-193.56.1.el8_2
There's no configuration in QEMU that would explain this difference. Can you report "hwloc-ls" and /proc/cpuinfo contents from the guest for both pinned and unpinned case
Ok, so based on this guest info and especially the error message from hwloc, I'm thinking there must be a bug in the way QEMU is exposing CPU topology information. I struggle to understand how CPU pinning can trigger such a bug, but it none the less seems to exist. The libvirt log file shows a mixture of fairly old software versions. I think the next step is to reproduce with latest released RHEL 8.4 kernel instead of outdated 8.3 kernel, along with RHEL-8.4 qemu-kvm package from RHEL-AV repos, instead of the Fedora qemu build If the pure RHEL-8.4 virt host shows the error, then assign the bug to RHEL-AV product + qemu-kvm component for investigation by QEMU maintainers.
Does this still happen w/out the numa passthrough? numa: guestMappingPassthrough : {} I think I see a potential issue in the threads=2 case: Due to kubelet reservedCpus (0,112,1,113,2,114,3,115) and other pods running on other cpus, this is the cpuset the numa pinned VM's pod gets: 4-102,104-106,108-110,116-210,212-214,216-218,220-222 which gives us from the host: N0: 54 cpus N1: 54 cpus N2: 54 cpus N3: 49 cpus guest definition: <cell id='0' cpus='0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,99,102,105,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165,169,173,177,181,185,189,193,197,200,203,206' memory='728760320' unit='KiB'/> <cell id='1' cpus='1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,100,103,106,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166,170,174,178,182,186,190,194,198,201,204,207' memory='728760320' unit='KiB'/> <cell id='2' cpus='2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,101,104,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167,171,175,179,183,187,191,195,199,202,205' memory='728760320' unit='KiB'/> <cell id='3' cpus='3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188,192,196' memory='728760320' unit='KiB'/> cell0 is 54 cpus cell1 is 54 cpus cell2 is 53 cpus -- odd number cell3 is 47 cpus -- odd number Also, the guest cpu/cell alignment looks strange after cpu98: guest cells: 0: 92 96 99 102 1: 93 97 100 103 2: 94 98 101 104 3: 95 108 112 116 which maps to these host cpus: 0: 96 100 104 108 1: 97 101 105 109 2: 98 102 106 110 3: 99 119 123 127 It seems like its getting out of line once it reached the first single cpu gap in the pod cpuset at host cpu 103?
> which maps to these host cpus: > 0: 96 100 104 108 > 1: 97 101 105 109 > 2: 98 102 106 110 > 3: 99 119 123 127 Actually now I see 119 is just the next cpu in N3 avail in the cpuset, but still wondering if the odd number of cpus is a problem?
(In reply to Jenifer Abrams from comment #21) > Does this still happen w/out the numa passthrough? > numa: > guestMappingPassthrough : {} > When numa passthrough is removed, the sockets/cores/threads information is correct. [root@hanavirt52 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 208 On-line CPU(s) list: 0-207 Thread(s) per core: 2 Core(s) per socket: 26 Socket(s): 4 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8280M CPU @ 2.70GHz Stepping: 7 CPU MHz: 2693.670 BogoMIPS: 5387.34 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-207 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
(In reply to djdumas from comment #24) > (In reply to Jenifer Abrams from comment #21) > > Does this still happen w/out the numa passthrough? > > numa: > > guestMappingPassthrough : {} > > > > When numa passthrough is removed, the sockets/cores/threads information is > correct. I will try to take a closer look at the logs, but if changing the VM NUMA topology changes behavior, it probably means you are creating a VM configuration where a CPU core is split between two different NUMA nodes (and this makes some guest software treat them as two distinct cores). I remember seeing 'lscpu', specifically, getting very confused when the NUMA topology didn't match the CPU core topology.
(In reply to Jenifer Abrams from comment #21) > guest definition: > > <cell id='0' > cpus='0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92, > 96,99,102,105,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165, > 169,173,177,181,185,189,193,197,200,203,206' memory='728760320' unit='KiB'/> > <cell id='1' > cpus='1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93, > 97,100,103,106,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166, > 170,174,178,182,186,190,194,198,201,204,207' memory='728760320' unit='KiB'/> > <cell id='2' > cpus='2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94, > 98,101,104,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167, > 171,175,179,183,187,191,195,199,202,205' memory='728760320' unit='KiB'/> > <cell id='3' > cpus='3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95, > 108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180, > 184,188,192,196' memory='728760320' unit='KiB'/> If this is a cores=26,thread=2 guest, you are splitting every single CPU core into two separate NUMA nodes. CPUs 0-1 are in the same core but different NUMA nodes, which makes the guest assume they are actually two separate cores with 1 thread each. The same for CPUs 2-3, 4-5, 6-7, etc. This may cause all kinds of confusion and inconsistencies in guest software because it never happens in real hardware.
(In reply to Eduardo Habkost from comment #26) > (In reply to Jenifer Abrams from comment #21) > If this is a cores=26,thread=2 guest, you are splitting every single CPU > core into two separate NUMA nodes. > > CPUs 0-1 are in the same core but different NUMA nodes, which makes the > guest assume they are actually two separate cores with 1 thread each. The > same for CPUs 2-3, 4-5, 6-7, etc. This may cause all kinds of confusion and > inconsistencies in guest software because it never happens in real hardware. Regardless of whether this config is sensible or not though, there still looks like a QEMU bug here. Merely pinning a guest CPU to a specific host CPU should not be affecting what topology is exposed to the guest, as that's merely a host side tuning knob, not guest ABI. It feels like there is something here not getting virtualized correctly and accidentely exposing some aspect of the host CPU that varies depending on which host CPU we happen to be placed on
(In reply to Daniel Berrangé from comment #27) > (In reply to Eduardo Habkost from comment #26) > > (In reply to Jenifer Abrams from comment #21) > > If this is a cores=26,thread=2 guest, you are splitting every single CPU > > core into two separate NUMA nodes. > > > > CPUs 0-1 are in the same core but different NUMA nodes, which makes the > > guest assume they are actually two separate cores with 1 thread each. The > > same for CPUs 2-3, 4-5, 6-7, etc. This may cause all kinds of confusion and > > inconsistencies in guest software because it never happens in real hardware. > > Regardless of whether this config is sensible or not though, there still > looks like a QEMU bug here. Merely pinning a guest CPU to a specific host > CPU should not be affecting what topology is exposed to the guest, as > that's merely a host side tuning knob, not guest ABI. It feels like there is > something here not getting virtualized correctly and accidentely exposing > some aspect of the host CPU that varies depending on which host CPU we > happen to be placed on I would agree if merely changing CPU pinning configuration is affecting the guest topology, but that's not what I saw on the log files at comment #10 and comment #11. default_sapvm1_pinning.log has: -numa node,nodeid=0,cpus=0,cpus=4,cpus=8,cpus=12,cpus=16,cpus=20,cpus=24,cpus=28,cpus=32,cpus=36,cpus=40,cpus=44,cpus=48,cpus=52,cpus=56,cpus=60,cpus=64,cpus=68,cpus=72,cpus=76,cpus=80,cpus=84,cpus=88,cpus=92,cpus=96,cpus=99,cpus=102,cpus=105,cpus=109,cpus=113,cpus=117,cpus=121,cpus=125,cpus=129,cpus=133,cpus=137,cpus=141,cpus=145,cpus=149,cpus=153,cpus=157,cpus=161,cpus=165,cpus=169,cpus=173,cpus=177,cpus=181,cpus=185,cpus=189,cpus=193,cpus=197,cpus=200,cpus=203,cpus=206,memdev=ram-node0 \ -object memory-backend-memfd,id=ram-node1,hugetlb=yes,hugetlbsize=1073741824,prealloc=yes,size=746250567680,host-nodes=1,policy=bind \ -numa node,nodeid=1,cpus=1,cpus=5,cpus=9,cpus=13,cpus=17,cpus=21,cpus=25,cpus=29,cpus=33,cpus=37,cpus=41,cpus=45,cpus=49,cpus=53,cpus=57,cpus=61,cpus=65,cpus=69,cpus=73,cpus=77,cpus=81,cpus=85,cpus=89,cpus=93,cpus=97,cpus=100,cpus=103,cpus=106,cpus=110,cpus=114,cpus=118,cpus=122,cpus=126,cpus=130,cpus=134,cpus=138,cpus=142,cpus=146,cpus=150,cpus=154,cpus=158,cpus=162,cpus=166,cpus=170,cpus=174,cpus=178,cpus=182,cpus=186,cpus=190,cpus=194,cpus=198,cpus=201,cpus=204,cpus=207,memdev=ram-node1 \ -object memory-backend-memfd,id=ram-node2,hugetlb=yes,hugetlbsize=1073741824,prealloc=yes,size=746250567680,host-nodes=2,policy=bind \ -numa node,nodeid=2,cpus=2,cpus=6,cpus=10,cpus=14,cpus=18,cpus=22,cpus=26,cpus=30,cpus=34,cpus=38,cpus=42,cpus=46,cpus=50,cpus=54,cpus=58,cpus=62,cpus=66,cpus=70,cpus=74,cpus=78,cpus=82,cpus=86,cpus=90,cpus=94,cpus=98,cpus=101,cpus=104,cpus=107,cpus=111,cpus=115,cpus=119,cpus=123,cpus=127,cpus=131,cpus=135,cpus=139,cpus=143,cpus=147,cpus=151,cpus=155,cpus=159,cpus=163,cpus=167,cpus=171,cpus=175,cpus=179,cpus=183,cpus=187,cpus=191,cpus=195,cpus=199,cpus=202,cpus=205,memdev=ram-node2 \ -object memory-backend-memfd,id=ram-node3,hugetlb=yes,hugetlbsize=1073741824,prealloc=yes,size=746250567680,host-nodes=3,policy=bind \ -numa node,nodeid=3,cpus=3,cpus=7,cpus=11,cpus=15,cpus=19,cpus=23,cpus=27,cpus=31,cpus=35,cpus=39,cpus=43,cpus=47,cpus=51,cpus=55,cpus=59,cpus=63,cpus=67,cpus=71,cpus=75,cpus=79,cpus=83,cpus=87,cpus=91,cpus=95,cpus=108,cpus=112,cpus=116,cpus=120,cpus=124,cpus=128,cpus=132,cpus=136,cpus=140,cpus=144,cpus=148,cpus=152,cpus=156,cpus=160,cpus=164,cpus=168,cpus=172,cpus=176,cpus=180,cpus=184,cpus=188,cpus=192,cpus=196,memdev=ram-node3 \ [...] default_sapvm1_nopinning.log has: -numa node,nodeid=0,cpus=0-207,memdev=ram-node0 \ [...]
(In reply to Eduardo Habkost from comment #28) > (In reply to Daniel Berrangé from comment #27) > > (In reply to Eduardo Habkost from comment #26) > > Regardless of whether this config is sensible or not though, there still > > looks like a QEMU bug here. Merely pinning a guest CPU to a specific host > > CPU should not be affecting what topology is exposed to the guest, as > > that's merely a host side tuning knob, not guest ABI. It feels like there is > > something here not getting virtualized correctly and accidentely exposing > > some aspect of the host CPU that varies depending on which host CPU we > > happen to be placed on > > I would agree if merely changing CPU pinning configuration is affecting the > guest topology, but that's not what I saw on the log files at comment #10 > and comment #11. Oh, my bad, I mis-interpreted the differences.
I'm attaching a spreadsheet that may help show a CPU configuration problem that is more than just the use of threads. The spreadsheet covers the 3 types of configurations we've tried - 214s/1c/1t, 4s/53c/1t, and 4s/26c/2t. The NUMA definition from the xml is only correct in the 214s/1c/1t case. If you scroll down the vcpupin for the 214s case you'll see where the cpuset numbers are no longer sequential (highlighted in yellow). That's not a problem - look over to the next column - the numa definition from the xml - and you'll see that there is no corresponding gap in the guest lscpu information. All good, the guest lscpu matches the host lscpu up to 214 vCPUs. Do the same exercise on the 4s-53c-1t tab (scroll down vcpupin column until cpuset 110/116 highlighted in yellow). It looks like this is being carried over to the numa definition which results in incorrect numa definitions from cpuset 110 and on.
(In reply to djdumas from comment #30) > I'm attaching a spreadsheet that may help show a CPU configuration problem > that is more than just the use of threads. > > The spreadsheet covers the 3 types of configurations we've tried - > 214s/1c/1t, 4s/53c/1t, and 4s/26c/2t. > The NUMA definition from the xml is only correct in the 214s/1c/1t case. > > If you scroll down the vcpupin for the 214s case you'll see where the cpuset > numbers are no longer sequential (highlighted in yellow). That's not a > problem - look over to the next column - the numa definition from the xml - > and you'll see that there is no corresponding gap in the guest lscpu > information. All good, the guest lscpu matches the host lscpu up to 214 > vCPUs. > > Do the same exercise on the 4s-53c-1t tab (scroll down vcpupin column until > cpuset 110/116 highlighted in yellow). It looks like this is being carried > over to the numa definition which results in incorrect numa definitions from > cpuset 110 and on. If the "numa definition from xml (cpuset)" section on 4s-53c-1t and 4s-26c-2t reflect the actual <cell cpus='...'> values, the domain XML really seems wrong. Are VCPUs 0-3 completely missing from the <cell> elements in the XML?
(In reply to Eduardo Habkost from comment #32) > (In reply to djdumas from comment #30) > > I'm attaching a spreadsheet that may help show a CPU configuration problem > > that is more than just the use of threads. > > > > The spreadsheet covers the 3 types of configurations we've tried - > > 214s/1c/1t, 4s/53c/1t, and 4s/26c/2t. > > The NUMA definition from the xml is only correct in the 214s/1c/1t case. > > > > If you scroll down the vcpupin for the 214s case you'll see where the cpuset > > numbers are no longer sequential (highlighted in yellow). That's not a > > problem - look over to the next column - the numa definition from the xml - > > and you'll see that there is no corresponding gap in the guest lscpu > > information. All good, the guest lscpu matches the host lscpu up to 214 > > vCPUs. > > > > Do the same exercise on the 4s-53c-1t tab (scroll down vcpupin column until > > cpuset 110/116 highlighted in yellow). It looks like this is being carried > > over to the numa definition which results in incorrect numa definitions from > > cpuset 110 and on. > > If the "numa definition from xml (cpuset)" section on 4s-53c-1t and > 4s-26c-2t reflect the actual <cell cpus='...'> values, the domain XML really > seems wrong. Are VCPUs 0-3 completely missing from the <cell> elements in > the XML? No, vcpus 0-3 are there - sorry I omitted them somehow, but the rest is correct. I'll replace the attachment. The following is from the 4s/53c/1t xml <numa> <cell id='0' cpus='0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167,171,175,179,183,187,191,195,199,203,207,210' memory='728760320' unit='KiB'/> <cell id='1' cpus='1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188,192,196,200,204,208,211' memory='728760320' unit='KiB'/> <cell id='2' cpus='2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,102,106,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165,169,173,177,181,185,189,193,197,201,205,209' memory='728760320' unit='KiB'/> <cell id='3' cpus='3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,99,103,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166,170,174,178,182,186,190,194,198,202,206' memory='728760320' unit='KiB'/>
On the 2 different issues: 1) As Dave pointed out, changing the guest topology from 'sockets:214 cores:1 threads:1' to 'sockets:4 cores:53 threads:1' results in that strange numbering in the guest numa cells 214s (214 total): 4s 53c (212 total): guest cell N0: 100 104 108 100 104 107 guest cell N1: 101 105 109 101 105 108 guest cell N2: 102 106 110 102 106 109 guest cell N3: 103 107 111 103 110 114 ======================================================== host N0: 104 108 116 104 108 116 host N0: 105 109 117 105 109 117 host N0: 106 110 118 106 110 118 host N0: 107 111 119 107 119 123 I am guessing the host numa pinning is correct for 4s53c if one of the 2 less cpus in the pod cpuset means host cpu 111 was not included, but why is vcpu 107 not mapped to host cpu 119? Maybe it doesn't actually matter if things are numa-aligned but the vcpu ordering is a bit confusing. 2) As Eduardo said, currently the numa pinning logic won't work for threads:2 since it pins across numa nodes: vcpu='0' cpuset='4'/> vcpu='1' cpuset='5'/> vcpu='2' cpuset='6'/> vcpu='3' cpuset='7'/> In past KVM testing w/ numa pinning the guest [0,1][2,3] sibling pairs are pinned to physical siblings like so, and it did not start pinning across nodes, it filled all of N0 first, then N1 and so on, ex: <vcpupin vcpu='0' cpuset='4'/> <vcpupin vcpu='1' cpuset='116'/> <vcpupin vcpu='2' cpuset='8'/> <vcpupin vcpu='3' cpuset='120'/>
(In reply to Eduardo Habkost from comment #26) > (In reply to Jenifer Abrams from comment #21) > > guest definition: > > > > <cell id='0' > > cpus='0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92, > > 96,99,102,105,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165, > > 169,173,177,181,185,189,193,197,200,203,206' memory='728760320' unit='KiB'/> > > <cell id='1' > > cpus='1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93, > > 97,100,103,106,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166, > > 170,174,178,182,186,190,194,198,201,204,207' memory='728760320' unit='KiB'/> > > <cell id='2' > > cpus='2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94, > > 98,101,104,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167, > > 171,175,179,183,187,191,195,199,202,205' memory='728760320' unit='KiB'/> > > <cell id='3' > > cpus='3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95, > > 108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180, > > 184,188,192,196' memory='728760320' unit='KiB'/> > > If this is a cores=26,thread=2 guest, you are splitting every single CPU > core into two separate NUMA nodes. > > CPUs 0-1 are in the same core but different NUMA nodes, which makes the > guest assume they are actually two separate cores with 1 thread each. The > same for CPUs 2-3, 4-5, 6-7, etc. This may cause all kinds of confusion and > inconsistencies in guest software because it never happens in real hardware. If this is the issue, then this is a clear bug in my pinning logic. That is not my intent. Working on it. Thanks.
Merged on main: https://github.com/kubevirt/kubevirt/pull/6251
Backport is filed: https://github.com/kubevirt/kubevirt/pull/6392
After the fix, lscpu now shows the correct sockets/cores/threads information in the guest: [root@hanavirt52 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 208 On-line CPU(s) list: 0-207 Thread(s) per core: 2 Core(s) per socket: 26 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8280M CPU @ 2.70GHz Stepping: 7 CPU MHz: 2693.670 BogoMIPS: 5387.34 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-53 NUMA node1 CPU(s): 54-107 NUMA node2 CPU(s): 108-161 NUMA node3 CPU(s): 162-207 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
NOTE: My BM setup had 64 CPU's per node. With 2 Sockets --------------- [kbidarka@localhost ocs]$ oc get vmi vm-rhel84-ocs-numa -o yaml | grep -A 10 "cpu:" -- cpu: cores: 15 dedicatedCpuPlacement: true isolateEmulatorThread: true model: host-passthrough numa: guestMappingPassthrough: {} sockets: 2 threads: 2 Guest Login: ------------------ Last login: Tue Nov 9 06:01:56 on ttyS0 [cloud-user@vm-rhel84-ocs-numa ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 60 On-line CPU(s) list: 0-59 Thread(s) per core: 2 Core(s) per socket: 15 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz Stepping: 7 CPU MHz: 2095.076 BogoMIPS: 4190.15 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-27 NUMA node1 CPU(s): 28-59 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities [cloud-user@vm-rhel84-ocs-numa ~]$ --------------------------------------------- With 4 sockets ----------------- [kbidarka@localhost ocs]$ oc get vmi vm-rhel84-ocs-numa -o yaml | grep -A 10 "cpu:" -- cpu: cores: 7 dedicatedCpuPlacement: true isolateEmulatorThread: true model: host-passthrough numa: guestMappingPassthrough: {} sockets: 4 threads: 2 Guest Login ---------------- [cloud-user@vm-rhel84-ocs-numa ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 2 Core(s) per socket: 7 Socket(s): 4 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz Stepping: 7 CPU MHz: 2095.076 BogoMIPS: 4190.15 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-55 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities [cloud-user@vm-rhel84-ocs-numa ~]$ [kbidarka@localhost ocs]$ Summary: 1) The bug indeed appears to be fixed, as we see the correct, socket, thread, core count, when using CPU-Pinning features. 2) NOTE, 'reservedSystemCPUs: "0,112,1,113,2,114,3,115"' was not set in KubeletConfig "cpumanager-enabled". Hope this is ok, here?
VERIFIED with virt-operator-container-v4.9.1-4
> 2) NOTE, 'reservedSystemCPUs: "0,112,1,113,2,114,3,115"' was not set in KubeletConfig "cpumanager-enabled". Hope this is ok, here? Yes that should be fine. We would have seen the wrong topology either way.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 4.9.1 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:5091