I see wrong mappings using cpumanager and the following configuration. E.g. in VM config cores: 10 sockets: 2 threads: 2 dedicatedCpuPlacement: true features: - name: invtsc policy: require isolateEmulatorThread: true model: host-passthrough numa: guestMappingPassthrough: {} gives [cloud-user@hana-test-108 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2194.710 BogoMIPS: 4389.42 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-5 NUMA node1 CPU(s): 6-39 and cores: 10 sockets: 4 threads: 2 dedicatedCpuPlacement: true features: - name: invtsc policy: require isolateEmulatorThread: true model: host-passthrough numa: guestMappingPassthrough: {} gives [cloud-user@hana-test4 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 4 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2194.710 BogoMIPS: 4389.42 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-35 NUMA node1 CPU(s): 36-79 I have enabled cpumanager as described here: https://docs.openshift.com/container-platform/4.10/scalability_and_performance/using-cpu-manager.html Note that NUMA node0 CPU(s) is wrong in both case, in the latter also the NUMA node(s) is wrong. This seems to be a bug. Please advice.
To me the question is how numa: guestMappingPassthrough: {} and cores: 10 sockets: 2 threads: 2 play together. To me they are at least partially conflicting.
Nils, please explain what you are expecting to happen.
This is the CPU layout of the bare metal host sh-4.4# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 176 On-line CPU(s) list: 0-175 Thread(s) per core: 2 Core(s) per socket: 22 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz BIOS Model name: Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz Stepping: 1 CPU MHz: 1198.744 BogoMIPS: 4389.71 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 56320K NUMA node0 CPU(s): 0-21,88-109 NUMA node1 CPU(s): 22-43,110-131 NUMA node2 CPU(s): 44-65,132-153 NUMA node3 CPU(s): 66-87,154-175 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d pln pts md_clear flush_l1d
To clarify here, when using this configuration cores: 10 sockets: 4 threads: 2 dedicatedCpuPlacement: true features: - name: invtsc policy: require isolateEmulatorThread: true model: host-passthrough numa: guestMappingPassthrough: {} I would expect to see 4 NUMA nodes in the guest, not 2 as below: [cloud-user@hana-test4 ~]$ lscpu ... CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 4 NUMA node(s): 2 <------------------------ This is wrong, should be 4 not 2 ... NUMA node0 CPU(s): 0-35 <------------------------ This CPU numbering is asymetrical and makes no sense, 35 cores here NUMA node1 CPU(s): 36-79 <------------------------ 44 cores here
This is the docu PR for the feature used: https://github.com/kubevirt/user-guide/pull/457/files
Reading up on what Roman wrote in the documentation: * Guests may see different NUMA topologies when being rescheduled. * The resulting NUMA topology may be asymmetrical. I wonder if it's not a bug (but a feature :p).
… or rather a side effect of the current constratints: CPUManager will give an arbitrary set of cores to the libvirt, libvirt will then relflect their physical topology to the VM. That's why the two things mentioned in the prev comment can happen. However, I think it is important to clarify how the passthrough relates to cores/sockets/threads - if both APIs can be used at the same time or not. This should then be clarified in the documentation.
Looking at the libvirt xml for the following configuration spec: domain: cpu: cores: 10 dedicatedCpuPlacement: true features: - name: invtsc policy: require isolateEmulatorThread: true model: host-passthrough numa: guestMappingPassthrough: {} sockets: 2 threads: 2 shows that there is no cell id='1' defined <cpu mode='host-passthrough' check='none' migratable='on'> <topology sockets='2' dies='1' cores='10' threads='2'/> <feature policy='require' name='invtsc'/> <numa> <cell id='0' cpus='0-39' memory='67108864' unit='KiB'/> </numa> </cpu></cpu> hence lscpu also sees only one NUMA node: [cloud-user@hana-test-107 ~]$ lscpu ... CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 1 ... NUMA node0 CPU(s): 0-39 There are two settings (guestMappingPassthrough, dedicatedCpuPlacement) needed but not sure if there are use cases where you would only need one of them. In my use case, there shall every socket be on a dedicated NUMA node and IMHO one option to set this behavior would be sufficient, but please correct me if I am overseeing something here.
FYI, there was a similar bug https://bugzilla.redhat.com/show_bug.cgi?id=1987329 reported and should be fixed by now https://github.com/kubevirt/kubevirt/pull/6251
@kbidarka @gkapoor Could you please provide lscpu output of the guest where QE tests the SAP HANA template?
Here is the output I've received from QE. Notices the core imbalance in the NUMA case. regular vm [cloud-user@sap-hana-vm-1658911529-292766 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6238L CPU @ 2.10GHz Stepping: 7 CPU MHz: 2095.078 BogoMIPS: 4190.15 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities ========================== with Numa [cloud-user@sap-hana-vm-1658915965-8135753 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 46 On-line CPU(s) list: 0-45 Thread(s) per core: 1 Core(s) per socket: 46 Socket(s): 1 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6238L CPU @ 2.10GHz Stepping: 7 CPU MHz: 2095.078 BogoMIPS: 4190.15 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-42 NUMA node1 CPU(s): 43-45 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
(In reply to Nils Koenig from comment #5) > To clarify here, when using this configuration > > cores: 10 > sockets: 4 > threads: 2 > dedicatedCpuPlacement: true > features: > - name: invtsc > policy: require > isolateEmulatorThread: true > model: host-passthrough > numa: > guestMappingPassthrough: {} > > I would expect to see 4 NUMA nodes in the guest, not 2 as below: Why do you expect to see 4 NUMA nodes in the guest? It looks like the 80 threads you requested all came from only 2 different nodes. > > [cloud-user@hana-test4 ~]$ lscpu > ... > CPU(s): 80 > On-line CPU(s) list: 0-79 > Thread(s) per core: 2 > Core(s) per socket: 10 > Socket(s): 4 > NUMA node(s): 2 <------------------------ This is wrong, > should be 4 not 2 > ... > NUMA node0 CPU(s): 0-35 <------------------------ This CPU > numbering is asymetrical and makes no sense, 35 cores here 36 cores actually, counting #0 > NUMA node1 CPU(s): 36-79 <------------------------ 44 cores here @nkoenig it's unclear to me exactly what you expected lscpu to look like. From what I understand, all guestMappingPassthrough does is assign all threads(/cores) that belong to the same physical node to the same emulated node. In the example above, 80 threads were requested, 36 came from one physical node and 44 came from a different one, so one emulated node was created for each group. To illustrate your point, could you please rewrite the lscpu output above to reflect what you expected it to looks like? Thank you!
Sorry it took so long to get back to you on this, but I wanted to run my own tests to make sure everything worked as expected. The issue here in my opinion is a lack of documentation as well as not-so-obvious API fields. First, it's important to note that all KubeVirt is able to do on the host side is request X CPUs from CPUManager in Kubernetes (resources/requests/CPU in the container spec). We have no way of requesting that the CPUs come from specific sockets or NUMA nodes or anything. The number of CPUs we request is simply the result of the multiplication sockets*cores*threads. However we can then take these CPUs and present them to the virtual machine however we want. Two sets of VMI options allow users to configure it: - sockets/cores/threads specify the CPU topology that the guest will see. The assignment is random (as far as I know), and nothing guarantees that 2 cores that are on the same physical socket will be in the same virtual socket in the guest. - guestMappingPassthrough instructs virt-launcher to create as many NUMA nodes as needed in the guest so that all the CPUs we got from CPUManager that were in the same NUMA node on the host will be in the same NUMA node in the guest. I agree that this is quite weak, and that being able to request more specific things from the hosts would be desirable, but it's just not the case today. I don't think there is an actual bug here, please let me know if you disagree, otherwise I will close this issue in a few days. The rest of this comment highlights the results of one of my tests: - This is the NUMA topology of the host: NUMA node0 CPU(s): 0-15 NUMA node1 CPU(s): 16-31 NUMA node2 CPU(s): 32-47 NUMA node3 CPU(s): 48-63 - I created a VMI with the following CPU spec: cores: 9 sockets: 4 threads: 1 dedicatedCpuPlacement: true numa: guestMappingPassthrough: {} - In virt-launcher, I can see which CPUs I got: bash-4.4# cat /sys/fs/cgroup/cpuset/cpuset.cpus 3-6,16-47 - So that's 4 CPUs from node0, all 16 CPUs from physical node1 and all 16 CPUs from physical node2 - In the guest XML, I can see the following mappings: bash-4.4# virsh dumpxml default_vmi-fedora [...] <cputune> <vcpupin vcpu='0' cpuset='3'/> <vcpupin vcpu='1' cpuset='4'/> <vcpupin vcpu='2' cpuset='5'/> <vcpupin vcpu='3' cpuset='6'/> <vcpupin vcpu='4' cpuset='16'/> <vcpupin vcpu='5' cpuset='17'/> <vcpupin vcpu='6' cpuset='18'/> <vcpupin vcpu='7' cpuset='19'/> <vcpupin vcpu='8' cpuset='20'/> <vcpupin vcpu='9' cpuset='21'/> <vcpupin vcpu='10' cpuset='22'/> <vcpupin vcpu='11' cpuset='23'/> <vcpupin vcpu='12' cpuset='24'/> <vcpupin vcpu='13' cpuset='25'/> <vcpupin vcpu='14' cpuset='26'/> <vcpupin vcpu='15' cpuset='27'/> <vcpupin vcpu='16' cpuset='28'/> <vcpupin vcpu='17' cpuset='29'/> <vcpupin vcpu='18' cpuset='30'/> <vcpupin vcpu='19' cpuset='31'/> <vcpupin vcpu='20' cpuset='32'/> <vcpupin vcpu='21' cpuset='33'/> <vcpupin vcpu='22' cpuset='34'/> <vcpupin vcpu='23' cpuset='35'/> <vcpupin vcpu='24' cpuset='36'/> <vcpupin vcpu='25' cpuset='37'/> <vcpupin vcpu='26' cpuset='38'/> <vcpupin vcpu='27' cpuset='39'/> <vcpupin vcpu='28' cpuset='40'/> <vcpupin vcpu='29' cpuset='41'/> <vcpupin vcpu='30' cpuset='42'/> <vcpupin vcpu='31' cpuset='43'/> <vcpupin vcpu='32' cpuset='44'/> <vcpupin vcpu='33' cpuset='45'/> <vcpupin vcpu='34' cpuset='46'/> <vcpupin vcpu='35' cpuset='47'/> </cputune> [...] <numa> <cell id='0' cpus='0-3' memory='350208' unit='KiB'/> <cell id='1' cpus='4-19' memory='350208' unit='KiB'/> <cell id='2' cpus='20-35' memory='348160' unit='KiB'/> </numa> [...] - We can indeed see 3 NUMA nodes with 4, 16 and 16 CPUs respectively, with a virtual-to-physical mapping that preserves NUMA node belonging - Finally inside the guest, lscpu shows that we indeed got both the CPU topology we requested and those 3 NUMA nodes: [root@vmi-fedora ~]# lscpu [...] Thread(s) per core: 1 Core(s) per socket: 9 Socket(s): 4 [...] NUMA: NUMA node(s): 3 NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-19 NUMA node2 CPU(s): 20-35 [...]
Closing this, as it is not a technically a bug. This issue highlights the fact that NUMA support in CNV is minimal, and it will hopefully improve in the future by leveraging kubernetes platform enhancements. Additionally to this discussion, the KubeVirt documentation on NUMA is a great source of information: https://kubevirt.io/user-guide/virtual_machines/numa/
Fine with me.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days