Bug 2113895 - [RFE] Provide deterministic NUMA topology for Multi VM support
Summary: [RFE] Provide deterministic NUMA topology for Multi VM support
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.13.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.13.0
Assignee: sgott
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-02 10:15 UTC by Nils Koenig
Modified: 2022-09-07 13:15 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-07 13:15:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2109255 0 high CLOSED CPU Topology is not correct when using dedicatedCpuPlacement 2023-09-18 04:42:24 UTC

Description Nils Koenig 2022-08-02 10:15:11 UTC
As of today the support for NUMA topology is somewhat rudimentary. 
Roman did some of the basic implementation 
https://github.com/kubevirt/user-guide/pull/457/files

But this has some nondeterministic behavior in the sense that kubevirt can not provide specific cores from specific NUMA nodes. This "might work" for the single big VM scenario, but is not sufficient to support multiple VMs. We need to be able to reliably acquire certain cores from specific NUMA nodes.  

What we need is a way to define how many NUMA nodes (and sockets), cores and threads a VM should have and the allocated resources should live 1:1 on the underlying hardware, e.g. the CPU cores from the same NUMA nodes bare metal and virtual and same for the memory attached to it. 

The sizes we should plan for are 1,2,4,... NUMA nodes per VM and if we can plan for sub NUMA node entities as well (e.g. "half" a CPU socket), that would be great.

The current implementation is specified as follows:

    spec:
      domain:
        cpu:
          cores: 10
          sockets: 4
          threads: 2   
          dedicatedCpuPlacement: true
          isolateEmulatorThread: true
          model: host-passthrough
          numa:
            guestMappingPassthrough: {}


Either this is being enhanced by the numa node count attribute, e.g.:

    spec:
      domain:
        cpu:
          cores: 10
          sockets: 4
          threads: 2   
          dedicatedCpuPlacement: true
          isolateEmulatorThread: true
          model: host-passthrough
          numa:
            guestMappingPassthrough: {}
            nodes: 4

or it's implicitly derived from the socket count when guestMappingPassthrough is set. I guess that's up for discussion since there might be platforms where there is no 1:1 mapping between socket and NUMA node.

Comment 1 Fabian Deutsch 2022-08-17 08:25:54 UTC
The "basic implementation" is actually almost everything that KubeVirt has to do.
Almost all other limitations, including the non-deterministic behavior, exist due to Kubernetes limitations.

In order to stay consistent with Kubernetes, it was an intentional decision to a) reflect the pNUMA topoloy in guests (vNUMA) as this is the best the virtualization layer can do. A second, but differently prioritized goal b) was to enhance Kubernetes to gain better NUMA awareness to do more optimal pNUMA assignments to pods.
If a better pNUMA assignment to pods happens, then KubeVirt and it's VMs will directly benefit.

Because it's Kubernetes responsibility, your RFE must not be solved in KubeVirt.

Comment 2 Fabian Deutsch 2022-09-07 13:15:44 UTC
This got moved to https://issues.redhat.com/browse/CNV-21084

Possibly this RFE can be addressed by leveraging https://docs.openshift.com/container-platform/4.11/scalability_and_performance/cnf-numa-aware-scheduling.html


Note You need to log in before you can comment on or make changes to this bug.