Bug 1944379

Summary: HostDevice allocatable & capacity count on nodes doesn't update when device no longer allowlisted in HCO CR
Product: Container Native Virtualization (CNV) Reporter: Kedar Bidarkar <kbidarka>
Component: VirtualizationAssignee: Jed Lejosne <jlejosne>
Status: CLOSED ERRATA QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: high    
Version: 4.8.0CC: cnv-qe-bugs, fdeutsch, jlejosne, sgott
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: hco-bundle-registry-container-v4.8.0-347 virt-operator-container-v4.8.0-58 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 14:29:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kedar Bidarkar 2021-03-29 19:44:43 UTC
Description of problem:
HostDevice allocatable & capacity count on nodes doesn't get updated when device no longer allowlisted in HCO CR.

Version-Release number of selected component (if applicable):
CNV-4.8.0

How reproducible:
always

Steps to Reproduce:
1. Update HCO CR to allowlist a HostDevice

]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
spec:
  permittedHostDevices:
    pciHostDevices:
      - pciVendorSelector: "10DE:1EB8"
        resourceName: "nvidia.com/TU104GL_Tesla_T4"
2. Ensure the hostdevice is visible under Allocatable and Capacity section of the Node.

Capacity:
  nvidia.com/TU104GL_Tesla_T4:    1
Allocatable:
  nvidia.com/TU104GL_Tesla_T4:    1

3. Remove the "permittedHostDevices" section from HCO CR

Actual results:
The hostdevice count is still "1" under Allocatable and Capacity section of the Node.

Capacity:
  nvidia.com/TU104GL_Tesla_T4:    1
Allocatable:
  nvidia.com/TU104GL_Tesla_T4:    1


Expected results:
The hostdevice count is "0" under Allocatable and Capacity section of the Node.

Capacity:
  nvidia.com/TU104GL_Tesla_T4:    0
Allocatable:
  nvidia.com/TU104GL_Tesla_T4:    0

Additional info:

1) This used to work fine earlier with "kubevirt-config CM".

2) The side-effect of this issue is "permittedHostDevices" functionality breaks once the device is allowlisted in HCO CR.

3) Once allowed, it always remains as allowed.

4) Probably the issue could be HCO side too.

Comment 1 Kedar Bidarkar 2021-03-30 11:39:14 UTC
I tested this again on a fresh cluster setup and I can reliably reproduce this issue.

Comment 2 Jed Lejosne 2021-03-30 12:56:09 UTC
This issue also reproduces directly on kubevirt, by editing the kubevirt CR.

Comment 3 Jed Lejosne 2021-04-01 12:26:53 UTC
Master PR merged.
PR backported to release-0.36 (CNV 2.6.z): https://github.com/kubevirt/kubevirt/pull/5375

Comment 4 Kedar Bidarkar 2021-04-07 11:28:48 UTC
NOTE: We are testing GPU/Host Device/PCI Passthrough stuff only from CNV-4.8.0+

Comment 5 sgott 2021-05-19 11:55:42 UTC
To verify: follow reproduction steps in description

Comment 6 Kedar Bidarkar 2021-05-31 13:59:01 UTC
The default behaviour seems to have changed recently, when looking at this along with HCO/hyperconverged CR
Now with CNV-4.8.0, the below pciHostDevices are configured by default in hyperconverged as seen below.

]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml  
  permittedHostDevices:
    pciHostDevices:
    - pciDeviceSelector: 10DE:1DB6
      resourceName: nvidia.com/GV100GL_Tesla_V100
    - pciDeviceSelector: 10DE:1EB8
      resourceName: nvidia.com/TU104GL_Tesla_T4

Removal of entire `permittedHostDevices` section is no longer allowed via HCO/hyperconverged.

So users, should not be hitting this bug.

Though this bug was fixed at KubeVirt level too, currently there seems to be no straightforward way to very this, 
after the change in behaviour from the HCO side.

Will be moving this to VERIFIED state.

Comment 9 Kedar Bidarkar 2021-06-23 17:45:17 UTC
Thanks @Stu for checking about the backports,

There appears to be a slight delay in updating the count from "1" to "0" under Allocatable and Capacity section of the Node.

Moving this back to VERIFIED state.

Comment 12 errata-xmlrpc 2021-07-27 14:29:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920