2113895 – [RFE] Provide deterministic NUMA topology for Multi VM support

Bug 2113895 - [RFE] Provide deterministic NUMA topology for Multi VM support

Summary: [RFE] Provide deterministic NUMA topology for Multi VM support

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.13.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.13.0
Assignee:	sgott
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-02 10:15 UTC by Nils Koenig
Modified:	2022-09-07 13:15 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-07 13:15:44 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2109255	0	high	CLOSED	CPU Topology is not correct when using dedicatedCpuPlacement	2023-09-18 04:42:24 UTC

Description Nils Koenig 2022-08-02 10:15:11 UTC

As of today the support for NUMA topology is somewhat rudimentary.
Roman did some of the basic implementation
https://github.com/kubevirt/user-guide/pull/457/files

But this has some nondeterministic behavior in the sense that kubevirt can not provide specific cores from specific NUMA nodes. This "might work" for the single big VM scenario, but is not sufficient to support multiple VMs. We need to be able to reliably acquire certain cores from specific NUMA nodes.

What we need is a way to define how many NUMA nodes (and sockets), cores and threads a VM should have and the allocated resources should live 1:1 on the underlying hardware, e.g. the CPU cores from the same NUMA nodes bare metal and virtual and same for the memory attached to it.

The sizes we should plan for are 1,2,4,... NUMA nodes per VM and if we can plan for sub NUMA node entities as well (e.g. "half" a CPU socket), that would be great.

The current implementation is specified as follows:

spec:
domain:
cpu:
cores: 10
sockets: 4
threads: 2
dedicatedCpuPlacement: true
isolateEmulatorThread: true
model: host-passthrough
numa:
guestMappingPassthrough: {}

Either this is being enhanced by the numa node count attribute, e.g.:

spec:
domain:
cpu:
cores: 10
sockets: 4
threads: 2
dedicatedCpuPlacement: true
isolateEmulatorThread: true
model: host-passthrough
numa:
guestMappingPassthrough: {}
nodes: 4

or it's implicitly derived from the socket count when guestMappingPassthrough is set. I guess that's up for discussion since there might be platforms where there is no 1:1 mapping between socket and NUMA node.

Comment 1 Fabian Deutsch 2022-08-17 08:25:54 UTC

The "basic implementation" is actually almost everything that KubeVirt has to do.
Almost all other limitations, including the non-deterministic behavior, exist due to Kubernetes limitations.

In order to stay consistent with Kubernetes, it was an intentional decision to a) reflect the pNUMA topoloy in guests (vNUMA) as this is the best the virtualization layer can do. A second, but differently prioritized goal b) was to enhance Kubernetes to gain better NUMA awareness to do more optimal pNUMA assignments to pods.
If a better pNUMA assignment to pods happens, then KubeVirt and it's VMs will directly benefit.

Because it's Kubernetes responsibility, your RFE must not be solved in KubeVirt.

Comment 2 Fabian Deutsch 2022-09-07 13:15:44 UTC

This got moved to https://issues.redhat.com/browse/CNV-21084

Possibly this RFE can be addressed by leveraging https://docs.openshift.com/container-platform/4.11/scalability_and_performance/cnf-numa-aware-scheduling.html

Note You need to log in before you can comment on or make changes to this bug.