Bug 2127159

Summary: Need empty permittedHostDevices section in HCO CR with Nvidia GPU Operator
Product: Container Native Virtualization (CNV) Reporter: Kedar Bidarkar <kbidarka>
Component: VirtualizationAssignee: Igor Bezukh <ibezukh>
Status: CLOSED NOTABUG QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: medium    
Version: 4.11.0CC: acardace, stirabos
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-25 15:10:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kedar Bidarkar 2022-09-15 14:05:43 UTC
Description of problem:

Need empty permittedHostDevices section in HCO CR with Nvidia GPU Operator

spec:
  permittedHostDevices: {}

This is required so that, only those host devices/GPUs are permitted to be used in the cluster, which are part of this section.

While configuring Nvidia Drivers via the Legacy approach on Nodes:
The node Capacity and Allocatable sections are updated with the GPU device, only after updating the "permittedHostDevices" section in HCO CR.

While configuring Nvidia Drivers via the Nvidia GPU Operator on Nodes:
The node Capacity and Allocatable sections are updated with the GPU device, even without updating the "permittedHostDevices" section in HCO CR.

Which makes the "permittedHostDevices" section in HCO CR, of no use.
Only if we add an empty spec.permittedHostDevices: {} section in HCO CR, the "permittedHostDevices" checks are honored.

Version-Release number of selected component (if applicable):
4.11.0

How reproducible:
Always

Steps to Reproduce:
1. Configure GPU-PT or vGPU with Nvidia GPU Operator
2.
3.

Actual results:
While configuring Nvidia Drivers via the Nvidia GPU Operator on Nodes:
The node Capacity and Allocatable sections are updated with the GPU device, even without updating the "permittedHostDevices" section in HCO CR.

Which makes the "permittedHostDevices" section in HCO CR, of no use.

Expected results:
Only if we add an empty spec.permittedHostDevices: {} section in HCO CR as default, the "permittedHostDevices" checks are honored.

Additional info:

Comment 1 Kedar Bidarkar 2022-09-15 14:06:17 UTC
Capacity:
  cpu:                            80
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/sev:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              937156932Ki
  hugepages-1Gi:                  4Gi
  hugepages-2Mi:                  512Mi
  memory:                         131481724Ki
  nvidia.com/GRID_V100D-2Q:       16
  pods:                           250
Allocatable:
  cpu:                            79500m
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/sev:        0
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              863683827102
  hugepages-1Gi:                  4Gi
  hugepages-2Mi:                  512Mi
  memory:                         125612156Ki
  nvidia.com/GRID_V100D-2Q:       16
  pods:                           250

Comment 2 Antonio Cardace 2022-09-28 12:25:05 UTC
@stirabos it looks like an HCO issue, can you take a look?

Comment 3 Simone Tiraboschi 2022-10-03 10:50:32 UTC
Both on HCO and on Kubevirt, PermittedHostDevices is a pointer with omitempty json tag see:
https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/api/v1beta1/hyperconverged_types.go#L64
https://github.com/kubevirt/api/blob/main/core/v1/types.go#L2189

so its default is a nil value.

So nothing on HCO by default:
oc get hco -n openshift-cnv kubevirt-hyperconverged -o=jsonpath='{.spec.permittedHostDevices}'

and so the same on Kubevirt:
oc get kubevirt -n openshift-cnv kubevirt-kubevirt-hyperconverged -o=jsonpath='{.spec.configuration.permittedHostDevices}'

when the user will explicitly set an empty dict on the HCO:

stirabos@t14s:~$ oc patch hco -n openshift-cnv --type=json kubevirt-hyperconverged -p '[{ "op": "replace", "path": /spec/permittedHostDevices, "value": {} }]'
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged patched
stirabos@t14s:~$ oc get hco -n openshift-cnv kubevirt-hyperconverged -o=jsonpath='{.spec.permittedHostDevices}'
{}


the HCO will propagate it to Kubevirt:
stirabos@t14s:~oc get kubevirt -n openshift-cnv kubevirt-kubevirt-hyperconverged -o=jsonpath='{.spec.configuration.permittedHostDevices}'
{}

so this at least looks consistent.

Why do you think this is a bug and what is the desired behaviour?

Comment 4 Simone Tiraboschi 2022-10-03 11:00:07 UTC
Just an additional comment, in k8s api convention, see:  https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md
"using a pointer allows distinguishing unset from the zero value for that type"

So here we correctly have an unset default which is semantically different than a value explicitly set to empty ({} in this case); basically something like NULL vs empty strings in relational databases.

Comment 6 Kedar Bidarkar 2023-02-22 13:26:43 UTC
Moving this to CNV 4.14, as we are using kubevirt Device Plugin for creating the mdevs.

Comment 7 Igor Bezukh 2023-07-09 16:35:46 UTC
Hi,

Can you please elaborate on the reason why this situation needs to be resolved on the virt core side?

Can't we modify the HCO CR deployment with spec.permittedHostDevices: {}?

Comment 8 Kedar Bidarkar 2023-07-25 15:10:04 UTC
Feel this bug can be closed.

While configuring Nvidia Drivers via the Nvidia GPU Operator, we are suggesting the following,
1) DisableMDEVConfiguration : False   ---> Should be the default value, as we suggest our customers to allow KubeVirt to perform the Mdev slicing.
2) vgpuDeviceManager.enabled=false  ---> As we suggest our customers this.

With the above changes AFAIK, this bug becomes obsolete.

Please feel free to re-open this, if anyone feels a need for the same.