Bug 1944379 - HostDevice allocatable & capacity count on nodes doesn't update when device no longer allowlisted in HCO CR
Summary: HostDevice allocatable & capacity count on nodes doesn't update when device n...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Jed Lejosne
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-29 19:44 UTC by Kedar Bidarkar
Modified: 2021-07-27 14:30 UTC (History)
4 users (show)

Fixed In Version: hco-bundle-registry-container-v4.8.0-347 virt-operator-container-v4.8.0-58
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 14:29:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 5350 0 None open device controller: remove all device plugins when permittedHostDevices is deleted 2021-03-30 18:09:56 UTC
Red Hat Product Errata RHSA-2021:2920 0 None None None 2021-07-27 14:30:51 UTC

Description Kedar Bidarkar 2021-03-29 19:44:43 UTC
Description of problem:
HostDevice allocatable & capacity count on nodes doesn't get updated when device no longer allowlisted in HCO CR.

Version-Release number of selected component (if applicable):
CNV-4.8.0

How reproducible:
always

Steps to Reproduce:
1. Update HCO CR to allowlist a HostDevice

]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
spec:
  permittedHostDevices:
    pciHostDevices:
      - pciVendorSelector: "10DE:1EB8"
        resourceName: "nvidia.com/TU104GL_Tesla_T4"
2. Ensure the hostdevice is visible under Allocatable and Capacity section of the Node.

Capacity:
  nvidia.com/TU104GL_Tesla_T4:    1
Allocatable:
  nvidia.com/TU104GL_Tesla_T4:    1

3. Remove the "permittedHostDevices" section from HCO CR

Actual results:
The hostdevice count is still "1" under Allocatable and Capacity section of the Node.

Capacity:
  nvidia.com/TU104GL_Tesla_T4:    1
Allocatable:
  nvidia.com/TU104GL_Tesla_T4:    1


Expected results:
The hostdevice count is "0" under Allocatable and Capacity section of the Node.

Capacity:
  nvidia.com/TU104GL_Tesla_T4:    0
Allocatable:
  nvidia.com/TU104GL_Tesla_T4:    0

Additional info:

1) This used to work fine earlier with "kubevirt-config CM".

2) The side-effect of this issue is "permittedHostDevices" functionality breaks once the device is allowlisted in HCO CR.

3) Once allowed, it always remains as allowed.

4) Probably the issue could be HCO side too.

Comment 1 Kedar Bidarkar 2021-03-30 11:39:14 UTC
I tested this again on a fresh cluster setup and I can reliably reproduce this issue.

Comment 2 Jed Lejosne 2021-03-30 12:56:09 UTC
This issue also reproduces directly on kubevirt, by editing the kubevirt CR.

Comment 3 Jed Lejosne 2021-04-01 12:26:53 UTC
Master PR merged.
PR backported to release-0.36 (CNV 2.6.z): https://github.com/kubevirt/kubevirt/pull/5375

Comment 4 Kedar Bidarkar 2021-04-07 11:28:48 UTC
NOTE: We are testing GPU/Host Device/PCI Passthrough stuff only from CNV-4.8.0+

Comment 5 sgott 2021-05-19 11:55:42 UTC
To verify: follow reproduction steps in description

Comment 6 Kedar Bidarkar 2021-05-31 13:59:01 UTC
The default behaviour seems to have changed recently, when looking at this along with HCO/hyperconverged CR
Now with CNV-4.8.0, the below pciHostDevices are configured by default in hyperconverged as seen below.

]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml  
  permittedHostDevices:
    pciHostDevices:
    - pciDeviceSelector: 10DE:1DB6
      resourceName: nvidia.com/GV100GL_Tesla_V100
    - pciDeviceSelector: 10DE:1EB8
      resourceName: nvidia.com/TU104GL_Tesla_T4

Removal of entire `permittedHostDevices` section is no longer allowed via HCO/hyperconverged.

So users, should not be hitting this bug.

Though this bug was fixed at KubeVirt level too, currently there seems to be no straightforward way to very this, 
after the change in behaviour from the HCO side.

Will be moving this to VERIFIED state.

Comment 9 Kedar Bidarkar 2021-06-23 17:45:17 UTC
Thanks @Stu for checking about the backports,

There appears to be a slight delay in updating the count from "1" to "0" under Allocatable and Capacity section of the Node.

Moving this back to VERIFIED state.

Comment 12 errata-xmlrpc 2021-07-27 14:29:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920


Note You need to log in before you can comment on or make changes to this bug.