Bug 1944379
| Summary: | HostDevice allocatable & capacity count on nodes doesn't update when device no longer allowlisted in HCO CR | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Kedar Bidarkar <kbidarka> |
| Component: | Virtualization | Assignee: | Jed Lejosne <jlejosne> |
| Status: | CLOSED ERRATA | QA Contact: | Kedar Bidarkar <kbidarka> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.8.0 | CC: | cnv-qe-bugs, fdeutsch, jlejosne, sgott |
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | hco-bundle-registry-container-v4.8.0-347 virt-operator-container-v4.8.0-58 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 14:29:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I tested this again on a fresh cluster setup and I can reliably reproduce this issue. This issue also reproduces directly on kubevirt, by editing the kubevirt CR. Master PR merged. PR backported to release-0.36 (CNV 2.6.z): https://github.com/kubevirt/kubevirt/pull/5375 NOTE: We are testing GPU/Host Device/PCI Passthrough stuff only from CNV-4.8.0+ To verify: follow reproduction steps in description The default behaviour seems to have changed recently, when looking at this along with HCO/hyperconverged CR
Now with CNV-4.8.0, the below pciHostDevices are configured by default in hyperconverged as seen below.
]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml
permittedHostDevices:
pciHostDevices:
- pciDeviceSelector: 10DE:1DB6
resourceName: nvidia.com/GV100GL_Tesla_V100
- pciDeviceSelector: 10DE:1EB8
resourceName: nvidia.com/TU104GL_Tesla_T4
Removal of entire `permittedHostDevices` section is no longer allowed via HCO/hyperconverged.
So users, should not be hitting this bug.
Though this bug was fixed at KubeVirt level too, currently there seems to be no straightforward way to very this,
after the change in behaviour from the HCO side.
Will be moving this to VERIFIED state.
Thanks @Stu for checking about the backports, There appears to be a slight delay in updating the count from "1" to "0" under Allocatable and Capacity section of the Node. Moving this back to VERIFIED state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2920 |
Description of problem: HostDevice allocatable & capacity count on nodes doesn't get updated when device no longer allowlisted in HCO CR. Version-Release number of selected component (if applicable): CNV-4.8.0 How reproducible: always Steps to Reproduce: 1. Update HCO CR to allowlist a HostDevice ]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv spec: permittedHostDevices: pciHostDevices: - pciVendorSelector: "10DE:1EB8" resourceName: "nvidia.com/TU104GL_Tesla_T4" 2. Ensure the hostdevice is visible under Allocatable and Capacity section of the Node. Capacity: nvidia.com/TU104GL_Tesla_T4: 1 Allocatable: nvidia.com/TU104GL_Tesla_T4: 1 3. Remove the "permittedHostDevices" section from HCO CR Actual results: The hostdevice count is still "1" under Allocatable and Capacity section of the Node. Capacity: nvidia.com/TU104GL_Tesla_T4: 1 Allocatable: nvidia.com/TU104GL_Tesla_T4: 1 Expected results: The hostdevice count is "0" under Allocatable and Capacity section of the Node. Capacity: nvidia.com/TU104GL_Tesla_T4: 0 Allocatable: nvidia.com/TU104GL_Tesla_T4: 0 Additional info: 1) This used to work fine earlier with "kubevirt-config CM". 2) The side-effect of this issue is "permittedHostDevices" functionality breaks once the device is allowlisted in HCO CR. 3) Once allowed, it always remains as allowed. 4) Probably the issue could be HCO side too.