2109255 – CPU Topology is not correct when using dedicatedCpuPlacement

Bug 2109255 - CPU Topology is not correct when using dedicatedCpuPlacement

Summary: CPU Topology is not correct when using dedicatedCpuPlacement

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.10.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Jed Lejosne
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-20 18:57 UTC by Nils Koenig
Modified:	2023-09-18 04:42 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-05 19:08:26 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1987329	0	high	CLOSED	Guest reports incorrect CPU topology when pinned and specifying CPU_THREADS	2022-07-26 10:31:55 UTC
Red Hat Issue Tracker	CNV-19957	0	None	None	None	2022-09-06 07:44:20 UTC

Internal Links: 2113895

Description Nils Koenig 2022-07-20 18:57:30 UTC

I see wrong mappings using cpumanager and the following configuration.

E.g. in VM config

          cores: 10
          sockets: 2
          threads: 2
          dedicatedCpuPlacement: true
          features:
          - name: invtsc
            policy: require
          isolateEmulatorThread: true
          model: host-passthrough
          numa:
            guestMappingPassthrough: {}

gives

[cloud-user@hana-test-108 ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz
Stepping:            1
CPU MHz:             2194.710
BogoMIPS:            4389.42
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-5
NUMA node1 CPU(s):   6-39


and

          cores: 10
          sockets: 4
          threads: 2
          dedicatedCpuPlacement: true
          features:
          - name: invtsc
            policy: require
          isolateEmulatorThread: true
          model: host-passthrough
          numa:
            guestMappingPassthrough: {}

gives

[cloud-user@hana-test4 ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           4
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz
Stepping:            1
CPU MHz:             2194.710
BogoMIPS:            4389.42
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-35
NUMA node1 CPU(s):   36-79

I have enabled cpumanager as described here:
https://docs.openshift.com/container-platform/4.10/scalability_and_performance/using-cpu-manager.html

Note that NUMA node0 CPU(s) is wrong in both case, in the latter also the NUMA node(s) is wrong. This seems to be a bug. Please advice.

Comment 2 Fabian Deutsch 2022-07-25 08:25:01 UTC

To me the question is how

          numa:
            guestMappingPassthrough: {}

and

          cores: 10
          sockets: 2
          threads: 2

play together.
To me they are at least partially conflicting.

Comment 3 Fabian Deutsch 2022-07-25 08:52:47 UTC

Nils, please explain what you are expecting to happen.

Comment 4 Nils Koenig 2022-07-25 12:49:40 UTC

This is the CPU layout of the bare metal host

sh-4.4# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              176
On-line CPU(s) list: 0-175
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           4
NUMA node(s):        4
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel(R) Corporation
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz
BIOS Model name:     Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz
Stepping:            1
CPU MHz:             1198.744
BogoMIPS:            4389.71
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            56320K
NUMA node0 CPU(s):   0-21,88-109
NUMA node1 CPU(s):   22-43,110-131
NUMA node2 CPU(s):   44-65,132-153
NUMA node3 CPU(s):   66-87,154-175
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
 pln pts md_clear flush_l1d

Comment 5 Nils Koenig 2022-07-25 12:52:45 UTC

To clarify here, when using this configuration

          cores: 10
          sockets: 4
          threads: 2
          dedicatedCpuPlacement: true
          features:
          - name: invtsc
            policy: require
          isolateEmulatorThread: true
          model: host-passthrough
          numa:
            guestMappingPassthrough: {}

I would expect to see 4 NUMA nodes in the guest, not 2 as below:

[cloud-user@hana-test4 ~]$ lscpu
...
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           4
NUMA node(s):        2            <------------------------ This is wrong, should be 4 not 2
...
NUMA node0 CPU(s):   0-35         <------------------------ This CPU numbering is asymetrical and makes no sense, 35 cores here 
NUMA node1 CPU(s):   36-79        <------------------------ 44 cores here

Comment 6 Nils Koenig 2022-07-26 04:21:09 UTC

This is the docu PR for the feature used: https://github.com/kubevirt/user-guide/pull/457/files

Comment 8 Nils Koenig 2022-07-26 04:43:47 UTC

Reading up on what Roman wrote in the documentation:

* Guests may see different NUMA topologies when being rescheduled.
* The resulting NUMA topology may be asymmetrical.

I wonder if it's not a bug (but a feature :p).

Comment 9 Fabian Deutsch 2022-07-26 08:32:15 UTC

… or rather a side effect of the current constratints: CPUManager will give an arbitrary set of cores to the libvirt, libvirt will then relflect their physical topology to the VM. That's why the two things mentioned in the prev comment can happen.
However, I think it is important to clarify how the passthrough relates to cores/sockets/threads - if both APIs can be used at the same time or not. This should then be clarified in the documentation.

Comment 10 Nils Koenig 2022-07-26 10:16:26 UTC

Looking at the libvirt xml for the following configuration

    spec:
      domain:
        cpu:
          cores: 10
          dedicatedCpuPlacement: true
          features:
            - name: invtsc
              policy: require
          isolateEmulatorThread: true
          model: host-passthrough
          numa:
            guestMappingPassthrough: {}
          sockets: 2
          threads: 2   


shows that there is no cell id='1' defined

  <cpu mode='host-passthrough' check='none' migratable='on'>
    <topology sockets='2' dies='1' cores='10' threads='2'/>
    <feature policy='require' name='invtsc'/>
    <numa>
      <cell id='0' cpus='0-39' memory='67108864' unit='KiB'/>
    </numa>
  </cpu></cpu>

hence lscpu also sees only one NUMA node:

[cloud-user@hana-test-107 ~]$ lscpu
...
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        1
...
NUMA node0 CPU(s):   0-39


There are two settings (guestMappingPassthrough, dedicatedCpuPlacement) needed but not sure if there are use cases where you would only need one of them. 
In my use case, there shall every socket be on a dedicated NUMA node and IMHO one option to set this behavior would be sufficient, but please correct me if I am overseeing something here.

Comment 11 Nils Koenig 2022-07-26 10:31:55 UTC

FYI, there was a similar bug 

https://bugzilla.redhat.com/show_bug.cgi?id=1987329

reported and should be fixed by now

https://github.com/kubevirt/kubevirt/pull/6251

Comment 12 Nils Koenig 2022-07-27 05:06:01 UTC

@kbidarka @gkapoor 
Could you please provide lscpu output of the guest where QE tests the SAP HANA template?

Comment 14 Nils Koenig 2022-08-01 07:52:53 UTC

Here is the output I've received from QE. Notices the core imbalance in the NUMA case.

regular vm
[cloud-user@sap-hana-vm-1658911529-292766 ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238L CPU @ 2.10GHz
Stepping:            7
CPU MHz:             2095.078
BogoMIPS:            4190.15
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
 
==========================
 
with Numa
 
[cloud-user@sap-hana-vm-1658915965-8135753 ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              46
On-line CPU(s) list: 0-45
Thread(s) per core:  1
Core(s) per socket:  46
Socket(s):           1
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238L CPU @ 2.10GHz
Stepping:            7
CPU MHz:             2095.078
BogoMIPS:            4190.15
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-42
NUMA node1 CPU(s):   43-45
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities

Comment 15 Jed Lejosne 2022-08-05 13:37:31 UTC

(In reply to Nils Koenig from comment #5)
> To clarify here, when using this configuration
> 
>           cores: 10
>           sockets: 4
>           threads: 2
>           dedicatedCpuPlacement: true
>           features:
>           - name: invtsc
>             policy: require
>           isolateEmulatorThread: true
>           model: host-passthrough
>           numa:
>             guestMappingPassthrough: {}
> 
> I would expect to see 4 NUMA nodes in the guest, not 2 as below:

Why do you expect to see 4 NUMA nodes in the guest?
It looks like the 80 threads you requested all came from only 2 different nodes.

> 
> [cloud-user@hana-test4 ~]$ lscpu
> ...
> CPU(s):              80
> On-line CPU(s) list: 0-79
> Thread(s) per core:  2
> Core(s) per socket:  10
> Socket(s):           4
> NUMA node(s):        2            <------------------------ This is wrong,
> should be 4 not 2
> ...
> NUMA node0 CPU(s):   0-35         <------------------------ This CPU
> numbering is asymetrical and makes no sense, 35 cores here 

36 cores actually, counting #0

> NUMA node1 CPU(s):   36-79        <------------------------ 44 cores here

@nkoenig it's unclear to me exactly what you expected lscpu to look like.
From what I understand, all guestMappingPassthrough does is assign all threads(/cores) that belong to the same physical node to the same emulated node.
In the example above, 80 threads were requested, 36 came from one physical node and 44 came from a different one, so one emulated node was created for each group.
To illustrate your point, could you please rewrite the lscpu output above to reflect what you expected it to looks like?
Thank you!

Comment 17 Jed Lejosne 2022-09-07 19:16:26 UTC

Sorry it took so long to get back to you on this, but I wanted to run my own tests to make sure everything worked as expected.

The issue here in my opinion is a lack of documentation as well as not-so-obvious API fields.

First, it's important to note that all KubeVirt is able to do on the host side is request X CPUs from CPUManager in Kubernetes (resources/requests/CPU in the container spec).
We have no way of requesting that the CPUs come from specific sockets or NUMA nodes or anything.
The number of CPUs we request is simply the result of the multiplication sockets*cores*threads.

However we can then take these CPUs and present them to the virtual machine however we want. Two sets of VMI options allow users to configure it:
- sockets/cores/threads specify the CPU topology that the guest will see. The assignment is random (as far as I know), and nothing guarantees that 2 cores that are on the same physical socket will be in the same virtual socket in the guest.
- guestMappingPassthrough instructs virt-launcher to create as many NUMA nodes as needed in the guest so that all the CPUs we got from CPUManager that were in the same NUMA node on the host will be in the same NUMA node in the guest.

I agree that this is quite weak, and that being able to request more specific things from the hosts would be desirable, but it's just not the case today.
I don't think there is an actual bug here, please let me know if you disagree, otherwise I will close this issue in a few days.

The rest of this comment highlights the results of one of my tests:
- This is the NUMA topology of the host:
NUMA node0 CPU(s):               0-15
NUMA node1 CPU(s):               16-31
NUMA node2 CPU(s):               32-47
NUMA node3 CPU(s):               48-63
- I created a VMI with the following CPU spec:
      cores: 9
      sockets: 4
      threads: 1
      dedicatedCpuPlacement: true
      numa:
        guestMappingPassthrough: {}
- In virt-launcher, I can see which CPUs I got:
bash-4.4# cat /sys/fs/cgroup/cpuset/cpuset.cpus
3-6,16-47
- So that's 4 CPUs from node0, all 16 CPUs from physical node1 and all 16 CPUs from physical node2
- In the guest XML, I can see the following mappings:
bash-4.4# virsh dumpxml default_vmi-fedora
[...]
  <cputune>
    <vcpupin vcpu='0' cpuset='3'/>
    <vcpupin vcpu='1' cpuset='4'/>
    <vcpupin vcpu='2' cpuset='5'/>
    <vcpupin vcpu='3' cpuset='6'/>
    <vcpupin vcpu='4' cpuset='16'/>
    <vcpupin vcpu='5' cpuset='17'/>
    <vcpupin vcpu='6' cpuset='18'/>
    <vcpupin vcpu='7' cpuset='19'/>
    <vcpupin vcpu='8' cpuset='20'/>
    <vcpupin vcpu='9' cpuset='21'/>
    <vcpupin vcpu='10' cpuset='22'/>
    <vcpupin vcpu='11' cpuset='23'/>
    <vcpupin vcpu='12' cpuset='24'/>
    <vcpupin vcpu='13' cpuset='25'/>
    <vcpupin vcpu='14' cpuset='26'/>
    <vcpupin vcpu='15' cpuset='27'/>
    <vcpupin vcpu='16' cpuset='28'/>
    <vcpupin vcpu='17' cpuset='29'/>
    <vcpupin vcpu='18' cpuset='30'/>
    <vcpupin vcpu='19' cpuset='31'/>
    <vcpupin vcpu='20' cpuset='32'/>
    <vcpupin vcpu='21' cpuset='33'/>
    <vcpupin vcpu='22' cpuset='34'/>
    <vcpupin vcpu='23' cpuset='35'/>
    <vcpupin vcpu='24' cpuset='36'/>
    <vcpupin vcpu='25' cpuset='37'/>
    <vcpupin vcpu='26' cpuset='38'/>
    <vcpupin vcpu='27' cpuset='39'/>
    <vcpupin vcpu='28' cpuset='40'/>
    <vcpupin vcpu='29' cpuset='41'/>
    <vcpupin vcpu='30' cpuset='42'/>
    <vcpupin vcpu='31' cpuset='43'/>
    <vcpupin vcpu='32' cpuset='44'/>
    <vcpupin vcpu='33' cpuset='45'/>
    <vcpupin vcpu='34' cpuset='46'/>
    <vcpupin vcpu='35' cpuset='47'/>
  </cputune>
[...]
    <numa>
      <cell id='0' cpus='0-3' memory='350208' unit='KiB'/>
      <cell id='1' cpus='4-19' memory='350208' unit='KiB'/>
      <cell id='2' cpus='20-35' memory='348160' unit='KiB'/>
    </numa>
[...]
- We can indeed see 3 NUMA nodes with 4, 16 and 16 CPUs respectively, with a virtual-to-physical mapping that preserves NUMA node belonging
- Finally inside the guest, lscpu shows that we indeed got both the CPU topology we requested and those 3 NUMA nodes:
[root@vmi-fedora ~]# lscpu
[...]
    Thread(s) per core:  1
    Core(s) per socket:  9
    Socket(s):           4
[...]
NUMA:                    
  NUMA node(s):          3
  NUMA node0 CPU(s):     0-3
  NUMA node1 CPU(s):     4-19
  NUMA node2 CPU(s):     20-35
[...]

Comment 19 Jed Lejosne 2022-10-05 19:08:26 UTC

Closing this, as it is not a technically a bug.
This issue highlights the fact that NUMA support in CNV is minimal, and it will hopefully improve in the future by leveraging kubernetes platform enhancements.
Additionally to this discussion, the KubeVirt documentation on NUMA is a great source of information:
https://kubevirt.io/user-guide/virtual_machines/numa/

Comment 20 Nils Koenig 2022-11-22 13:21:45 UTC

Fine with me.

Comment 21 Red Hat Bugzilla 2023-09-18 04:42:24 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.