Bug 2113895

Summary: [RFE] Provide deterministic NUMA topology for Multi VM support
Product: Container Native Virtualization (CNV) Reporter: Nils Koenig <nkoenig>
Component: VirtualizationAssignee: sgott
Status: CLOSED DEFERRED QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13.0CC: cnv-qe-bugs, djdumas, fdeutsch, gkapoor, nkoenig, sgott
Target Milestone: ---   
Target Release: 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-07 13:15:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nils Koenig 2022-08-02 10:15:11 UTC
As of today the support for NUMA topology is somewhat rudimentary. 
Roman did some of the basic implementation 
https://github.com/kubevirt/user-guide/pull/457/files

But this has some nondeterministic behavior in the sense that kubevirt can not provide specific cores from specific NUMA nodes. This "might work" for the single big VM scenario, but is not sufficient to support multiple VMs. We need to be able to reliably acquire certain cores from specific NUMA nodes.  

What we need is a way to define how many NUMA nodes (and sockets), cores and threads a VM should have and the allocated resources should live 1:1 on the underlying hardware, e.g. the CPU cores from the same NUMA nodes bare metal and virtual and same for the memory attached to it. 

The sizes we should plan for are 1,2,4,... NUMA nodes per VM and if we can plan for sub NUMA node entities as well (e.g. "half" a CPU socket), that would be great.

The current implementation is specified as follows:

    spec:
      domain:
        cpu:
          cores: 10
          sockets: 4
          threads: 2   
          dedicatedCpuPlacement: true
          isolateEmulatorThread: true
          model: host-passthrough
          numa:
            guestMappingPassthrough: {}


Either this is being enhanced by the numa node count attribute, e.g.:

    spec:
      domain:
        cpu:
          cores: 10
          sockets: 4
          threads: 2   
          dedicatedCpuPlacement: true
          isolateEmulatorThread: true
          model: host-passthrough
          numa:
            guestMappingPassthrough: {}
            nodes: 4

or it's implicitly derived from the socket count when guestMappingPassthrough is set. I guess that's up for discussion since there might be platforms where there is no 1:1 mapping between socket and NUMA node.

Comment 1 Fabian Deutsch 2022-08-17 08:25:54 UTC
The "basic implementation" is actually almost everything that KubeVirt has to do.
Almost all other limitations, including the non-deterministic behavior, exist due to Kubernetes limitations.

In order to stay consistent with Kubernetes, it was an intentional decision to a) reflect the pNUMA topoloy in guests (vNUMA) as this is the best the virtualization layer can do. A second, but differently prioritized goal b) was to enhance Kubernetes to gain better NUMA awareness to do more optimal pNUMA assignments to pods.
If a better pNUMA assignment to pods happens, then KubeVirt and it's VMs will directly benefit.

Because it's Kubernetes responsibility, your RFE must not be solved in KubeVirt.

Comment 2 Fabian Deutsch 2022-09-07 13:15:44 UTC
This got moved to https://issues.redhat.com/browse/CNV-21084

Possibly this RFE can be addressed by leveraging https://docs.openshift.com/container-platform/4.11/scalability_and_performance/cnf-numa-aware-scheduling.html