Description of problem: HostDevice allocatable & capacity count on nodes doesn't get updated when device no longer allowlisted in HCO CR. Version-Release number of selected component (if applicable): CNV-4.8.0 How reproducible: always Steps to Reproduce: 1. Update HCO CR to allowlist a HostDevice ]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv spec: permittedHostDevices: pciHostDevices: - pciVendorSelector: "10DE:1EB8" resourceName: "nvidia.com/TU104GL_Tesla_T4" 2. Ensure the hostdevice is visible under Allocatable and Capacity section of the Node. Capacity: nvidia.com/TU104GL_Tesla_T4: 1 Allocatable: nvidia.com/TU104GL_Tesla_T4: 1 3. Remove the "permittedHostDevices" section from HCO CR Actual results: The hostdevice count is still "1" under Allocatable and Capacity section of the Node. Capacity: nvidia.com/TU104GL_Tesla_T4: 1 Allocatable: nvidia.com/TU104GL_Tesla_T4: 1 Expected results: The hostdevice count is "0" under Allocatable and Capacity section of the Node. Capacity: nvidia.com/TU104GL_Tesla_T4: 0 Allocatable: nvidia.com/TU104GL_Tesla_T4: 0 Additional info: 1) This used to work fine earlier with "kubevirt-config CM". 2) The side-effect of this issue is "permittedHostDevices" functionality breaks once the device is allowlisted in HCO CR. 3) Once allowed, it always remains as allowed. 4) Probably the issue could be HCO side too.
I tested this again on a fresh cluster setup and I can reliably reproduce this issue.
This issue also reproduces directly on kubevirt, by editing the kubevirt CR.
Master PR merged. PR backported to release-0.36 (CNV 2.6.z): https://github.com/kubevirt/kubevirt/pull/5375
NOTE: We are testing GPU/Host Device/PCI Passthrough stuff only from CNV-4.8.0+
To verify: follow reproduction steps in description
The default behaviour seems to have changed recently, when looking at this along with HCO/hyperconverged CR Now with CNV-4.8.0, the below pciHostDevices are configured by default in hyperconverged as seen below. ]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml permittedHostDevices: pciHostDevices: - pciDeviceSelector: 10DE:1DB6 resourceName: nvidia.com/GV100GL_Tesla_V100 - pciDeviceSelector: 10DE:1EB8 resourceName: nvidia.com/TU104GL_Tesla_T4 Removal of entire `permittedHostDevices` section is no longer allowed via HCO/hyperconverged. So users, should not be hitting this bug. Though this bug was fixed at KubeVirt level too, currently there seems to be no straightforward way to very this, after the change in behaviour from the HCO side. Will be moving this to VERIFIED state.
Thanks @Stu for checking about the backports, There appears to be a slight delay in updating the count from "1" to "0" under Allocatable and Capacity section of the Node. Moving this back to VERIFIED state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2920