2216423 – On AMD nodes we advertise INTEL CPU models

Bug 2216423 - On AMD nodes we advertise INTEL CPU models

Summary: On AMD nodes we advertise INTEL CPU models

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.13.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.14.0
Assignee:	Barak
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-21 09:46 UTC by vsibirsk
Modified:	2023-11-08 04:25 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-07-10 12:26:45 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-30147	0	None	None	None	2023-06-21 09:49:59 UTC

Description vsibirsk 2023-06-21 09:46:42 UTC

Description of problem: on AMD nodes, we see these cpu-model labels
cpu-model-migration.node.kubevirt.io/EPYC-Rome=true
cpu-model-migration.node.kubevirt.io/Nehalem=true
cpu-model-migration.node.kubevirt.io/Nehalem-IBRS=true
cpu-model-migration.node.kubevirt.io/Opteron_G1=true
cpu-model-migration.node.kubevirt.io/Opteron_G2=true
cpu-model-migration.node.kubevirt.io/Penryn=true
cpu-model-migration.node.kubevirt.io/SandyBridge=true
cpu-model-migration.node.kubevirt.io/SandyBridge-IBRS=true
cpu-model-migration.node.kubevirt.io/Westmere=true
cpu-model-migration.node.kubevirt.io/Westmere-IBRS=true

Here only EPYC-Rome and Opteron_G1/2 are AMD CPU models


Version-Release number of selected component (if applicable):
4.13.0 (where found, might be same on older versions)

How reproducible:
100%

Steps to Reproduce:
1. oc describe node <node-name>
2.
3.

Actual results:
Intel CPU models present on AMD CPU nodes

Expected results:
node labels has only relevant cpu models

Additional info:
This wrong labeling causes issues with nested virtualization (windows 10/11 vm can't run wsl2 internal vm) when vm is assigned with wrong cpu model

Comment 2 Barak 2023-06-22 08:13:30 UTC

The fact that the labels for intel CPUs appear on the cluster's nodes doesn't mean we support them for nested virtualization.

The compatibility of specific CPU models for nested virtualization can vary depending on the hypervisor,
virtualization software, and the CPU itself. 
It's possible that the "Nehalem" CPU model is not supported for nested virtualization in your particular setup.

Comment 3 Jiri Denemark 2023-06-22 10:30:37 UTC

I'm not sure what I'm asked for in comment #2 as it rather looks like a
comment to the reporter. So I'm going to comment on the bug itself.

Libvirt provides a vendor for each supported CPU model in domain capabilities
XML so it's trivial to filter just those that match host vendor if needed:

    virsh domcapabilities --xpath "//cpu/mode[@name='custom']/model[@usable='yes' and @vendor='AMD']"

and the host vendor can easily be fetched from capabilities XML:

    virsh capabilities --xpath "//host/cpu/arch/text()"

This is supported since libvirt-8.9.0 (i.e., RHEL 9.2.0).

Comment 4 Fabian Deutsch 2023-06-27 08:37:55 UTC

Barak, you are suggesting to limit the announced CPUs to CPUs of the same vendor as the host.
However, the bug is about nesting running into trouble.

However, it feels like filtering the announced CPUs by vendor is a workaround.
Isn't the problem rather that:
a. CNV announces host CPUs as compatible which aren't, They aren't because LM with WSl2 breaks
b. CNV does not express a schduling constratint on everything relevant for LM with nesting and WSL2?
c. Libvirt assumes host CPUs to be equal, even if they do differ when it comes to nesting

Which of this is what we are seeing here?

Comment 5 Barak 2023-06-27 08:58:25 UTC

Hi Fabian,

I understand there seems to be some confusion. Let me clarify my stance on the matter.

I am not suggesting limiting the announced CPUs to CPUs of the same vendor as the host. 
Our current approach involves publishing CPU models that are supported for virtualization. 
This means that users have the flexibility to set the vm.spec.domain.cpu.model to any CPU 
model that has the label cpu-model.node.kubevirt.io/<CPUModel>: "true" on the cluster nodes.

It's worth mentioning that the compatibility of specific CPU models for nested virtualization
can vary depending on factors such as the hypervisor, virtualization software, and the CPU itself.
Hence, As far as i know we do not guarantee that this set of CPU models will work flawlessly with nested virtualization.

Comment 6 Fabian Deutsch 2023-06-27 10:45:25 UTC

Agreed.

Now, we have the hypervisor, the kernel, and the hardware knowledge - we as in virt-handler.
Are there any labels that handler could add to a node, and are there labels launcher could look for, in order to make live migration with nesting safer?

Comment 7 Barak 2023-06-27 14:21:23 UTC

understand your concerns, but providing specific information about supported nested CPU models with different hypervisors and guest CPU models can become very complex. 
Nested virtualization support depends on various factors, such as the CPU type and hypervisor combination, making it difficult to create a definitive and comprehensive list.

Instead, I would consider using the "host-model" configuration when nested virtualization is required. 
This configuration allows the guest VM to inherit the CPU model of the host, ensuring a closer match and potentially 
better compatibility for nested virtualization.

Comment 8 Vladik Romanovsky 2023-06-29 16:52:21 UTC

I just wanted to add my 2 cents.

@vsibirsk These labels that you've mentioned indicate which CPU models[1] are supported on the node.
But it doesn't mean that nested virtualization features will be enabled in the guest.
To enable nested virtualization, you would also need to request a VMX flag (for Intel CPU models) or SVM (for AMD)

For example, for Nehalem you'd also need to request VMX (not SVM) as the guest's bios will look for the VMX flag to enable the virtualization features in the guest.
In this case, when you are also requesting the VMX feature, the scheduler will schedule your VM on a node that has the VMX flag available.

With host-model, the relevant VMX/SVM flag will be added and required (host-model-required-features.node.kubevirt.io/vmx=true) - given the flag is enabled on the node.


In general, KubeVirt doesn't offer an API that would guarantee that the nested virtualization features will be eventually enabled in the guest. There is also no intention to add such an API.
Theoretically, KubeVirt could validate a mismatch between the requested CPU model and features. However, I think this is too complex and unnecessary.

I don't think that this issue is a bug. I also don't believe we have anything to fix here. 
I recommend to close this BZ.

[1] You can see the definitions of the various CPU models here: https://github.com/libvirt/libvirt/blob/master/src/cpu_map

Comment 9 Fabian Deutsch 2023-06-30 09:29:48 UTC

If I understand Vladik correctly, then we are saying:
1. Recommended: Nesting works out of the box with model: host
2. Nesting does not work out of the box with named CPU models, because they do not include vmx/svm flags.
3. Nesting works with a named CPU model if the specific SVM or VMX feature is specified (IIUIC)

Comment 10 Fabian Deutsch 2023-06-30 10:32:44 UTC

FWIW our current kcs explicitly mentions that model or passthrough needs to be set:
https://access.redhat.com/solutions/6692341
It actually also covers #3.

Thus I think we are good for now.

Vasily, thoughts?

Comment 11 vsibirsk 2023-07-02 10:36:51 UTC

I agree it can be closed.

The issue in the first place was caused by our automation messing with hco cpu config (which caused the selection of Intel cpu model although vm spec was suited for AMD).
Then there was a confusion that we were not sure if it's even suppose to be like this (Intel cpu models available on AMD node). The only time we saw AMD cpu models on Intel nodes it was a bug (https://bugzilla.redhat.com/show_bug.cgi?id=2122283)

Comment 12 sgott 2023-07-10 12:26:45 UTC

Closing based on Comment #11

Comment 13 Red Hat Bugzilla 2023-11-08 04:25:24 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.