Description of problem: Need empty permittedHostDevices section in HCO CR with Nvidia GPU Operator spec: permittedHostDevices: {} This is required so that, only those host devices/GPUs are permitted to be used in the cluster, which are part of this section. While configuring Nvidia Drivers via the Legacy approach on Nodes: The node Capacity and Allocatable sections are updated with the GPU device, only after updating the "permittedHostDevices" section in HCO CR. While configuring Nvidia Drivers via the Nvidia GPU Operator on Nodes: The node Capacity and Allocatable sections are updated with the GPU device, even without updating the "permittedHostDevices" section in HCO CR. Which makes the "permittedHostDevices" section in HCO CR, of no use. Only if we add an empty spec.permittedHostDevices: {} section in HCO CR, the "permittedHostDevices" checks are honored. Version-Release number of selected component (if applicable): 4.11.0 How reproducible: Always Steps to Reproduce: 1. Configure GPU-PT or vGPU with Nvidia GPU Operator 2. 3. Actual results: While configuring Nvidia Drivers via the Nvidia GPU Operator on Nodes: The node Capacity and Allocatable sections are updated with the GPU device, even without updating the "permittedHostDevices" section in HCO CR. Which makes the "permittedHostDevices" section in HCO CR, of no use. Expected results: Only if we add an empty spec.permittedHostDevices: {} section in HCO CR as default, the "permittedHostDevices" checks are honored. Additional info:
Capacity: cpu: 80 devices.kubevirt.io/kvm: 1k devices.kubevirt.io/sev: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 937156932Ki hugepages-1Gi: 4Gi hugepages-2Mi: 512Mi memory: 131481724Ki nvidia.com/GRID_V100D-2Q: 16 pods: 250 Allocatable: cpu: 79500m devices.kubevirt.io/kvm: 1k devices.kubevirt.io/sev: 0 devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 863683827102 hugepages-1Gi: 4Gi hugepages-2Mi: 512Mi memory: 125612156Ki nvidia.com/GRID_V100D-2Q: 16 pods: 250
@stirabos it looks like an HCO issue, can you take a look?
Both on HCO and on Kubevirt, PermittedHostDevices is a pointer with omitempty json tag see: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/api/v1beta1/hyperconverged_types.go#L64 https://github.com/kubevirt/api/blob/main/core/v1/types.go#L2189 so its default is a nil value. So nothing on HCO by default: oc get hco -n openshift-cnv kubevirt-hyperconverged -o=jsonpath='{.spec.permittedHostDevices}' and so the same on Kubevirt: oc get kubevirt -n openshift-cnv kubevirt-kubevirt-hyperconverged -o=jsonpath='{.spec.configuration.permittedHostDevices}' when the user will explicitly set an empty dict on the HCO: stirabos@t14s:~$ oc patch hco -n openshift-cnv --type=json kubevirt-hyperconverged -p '[{ "op": "replace", "path": /spec/permittedHostDevices, "value": {} }]' hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged patched stirabos@t14s:~$ oc get hco -n openshift-cnv kubevirt-hyperconverged -o=jsonpath='{.spec.permittedHostDevices}' {} the HCO will propagate it to Kubevirt: stirabos@t14s:~oc get kubevirt -n openshift-cnv kubevirt-kubevirt-hyperconverged -o=jsonpath='{.spec.configuration.permittedHostDevices}' {} so this at least looks consistent. Why do you think this is a bug and what is the desired behaviour?
Just an additional comment, in k8s api convention, see: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md "using a pointer allows distinguishing unset from the zero value for that type" So here we correctly have an unset default which is semantically different than a value explicitly set to empty ({} in this case); basically something like NULL vs empty strings in relational databases.
Moving this to CNV 4.14, as we are using kubevirt Device Plugin for creating the mdevs.
Hi, Can you please elaborate on the reason why this situation needs to be resolved on the virt core side? Can't we modify the HCO CR deployment with spec.permittedHostDevices: {}?
Feel this bug can be closed. While configuring Nvidia Drivers via the Nvidia GPU Operator, we are suggesting the following, 1) DisableMDEVConfiguration : False ---> Should be the default value, as we suggest our customers to allow KubeVirt to perform the Mdev slicing. 2) vgpuDeviceManager.enabled=false ---> As we suggest our customers this. With the above changes AFAIK, this bug becomes obsolete. Please feel free to re-open this, if anyone feels a need for the same.