Bug 2046298
| Summary: | mdevs not configured with drivers installed, if mdev config added to HCO CR before drivers are installed | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Kedar Bidarkar <kbidarka> |
| Component: | Virtualization | Assignee: | Jed Lejosne <jlejosne> |
| Status: | CLOSED ERRATA | QA Contact: | Akriti Gupta <akrgupta> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.10.0 | CC: | acardace, fdeutsch, jlejosne, sgott, ycui |
| Target Milestone: | --- | ||
| Target Release: | 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | hco-bundle-registry-containerv-4.12.0-736 | Doc Type: | Known Issue |
| Doc Text: |
If you configure the HyperConverged custom resource (CR) to enable mediated devices before drivers are installed, enablement of mediated devices does not occur. This issue can be triggered by updates. For example, if virt-handler is updated before daemonset, which installs NVIDIA drivers, then nodes cannot provide virtual machine GPUs. (BZ#2046298)
As a workaround:
1. Remove mediatedDevicesConfiguration and permittedHostDevices from the HyperConverged CR.
2. Update both mediatedDevicesConfiguration and permittedHostDevices stanzas with the configuration you want to use.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-24 13:36:05 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Correction with, Steps to Reproduce:
0) Do not configure the GPU nodes with the NVIDIA Drivers.
1. Update HCO CR with the below config.
mediatedDevicesConfiguration:
mediatedDevicesTypes:
- nvidia-231
permittedHostDevices:
mediatedDevices:
- mdevNameSelector: "GRID T4-2Q"
resourceName: "nvidia.com/GRID_T4_2Q"
2. Configure the GPU nodes with the NVIDIA Drivers.
3. Remove the above HCO CR entry.
4. Update HCO CR with the below config.
mediatedDevicesConfiguration:
mediatedDevicesTypes:
- nvidia-231
permittedHostDevices:
mediatedDevices:
- mdevNameSelector: "GRID T4-2Q"
resourceName: "nvidia.com/GRID_T4_2Q"
The current workaround is to, 1) Remove the entries of "mediatedDevicesConfiguration" and "permittedHostDevices" from HCO CR and 2) Again update the HCO CR with the desired configuration of "mediatedDevicesConfiguration" and "permittedHostDevices". Deferring this to the next release due to bandwidth. @kbidarka the steps to reproduce the issue include removing and re-adding the CR entries after installing the drivers. Then the workaround is to remove and re-add the CR entries after installing the drivers... I'm confused, the steps to reproduce the issue and the steps to fix it seem identical! Please clarify, thanks! I should have split, reproducer and workaround separately.
0) Do not configure the GPU nodes with the NVIDIA Drivers.
1. Update HCO CR with the below config.
mediatedDevicesConfiguration:
mediatedDevicesTypes:
- nvidia-231
permittedHostDevices:
mediatedDevices:
- mdevNameSelector: "GRID T4-2Q"
resourceName: "nvidia.com/GRID_T4_2Q"
2. Configure the GPU nodes with the NVIDIA Drivers.
Notice that the MDEV devices are not created successfully as the NVDIA Drivers were installed after updating the HCO CR with mediatedDevicesConfiguration.
Workaround:
1. Remove the above HCO CR entry.
2. Update HCO CR with the below config.
mediatedDevicesConfiguration:
mediatedDevicesTypes:
- nvidia-231
permittedHostDevices:
mediatedDevices:
- mdevNameSelector: "GRID T4-2Q"
resourceName: "nvidia.com/GRID_T4_2Q"
Deferring to 4.12.1 as it's already in the 4.12.0 (release-0.58) merge-pool https://github.com/kubevirt/kubevirt/pull/8809. Moving to ON_QA and back to 4.12 as this got pulled in by a recent rebase that was required to include the fix for https://bugzilla.redhat.com/show_bug.cgi?id=2139896. [kbidarka@localhost auth]$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-6d67796f46-m9fbv 1/1 Running 0 19h
nvidia-sandbox-validator-jxcbn 1/1 Running 0 2m55s
nvidia-vfio-manager-qtcnf 1/1 Running 0 3m31s
[kbidarka@localhost auth]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited
]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml
...
mediatedDevicesConfiguration:
mediatedDevicesTypes:
- nvidia-182
permittedHostDevices:
mediatedDevices:
- mdevNameSelector: GRID V100D-4Q
resourceName: nvidia.com/GRID_V100D_4Q
...
----------------------------
]$ oc label node node21.redhat.com --overwrite nvidia.com/gpu.workload.config=vm-vgpu
node/node21.redhat.com labeled
]$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-6d67796f46-m9fbv 1/1 Running 0 19h
nvidia-sandbox-validator-jxcbn 1/1 Terminating 0 9m45s
nvidia-vfio-manager-qtcnf 1/1 Terminating 0 10m
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h 0/2 Init:0/1 0 4s
]$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-6d67796f46-m9fbv 1/1 Running 0 19h
nvidia-sandbox-validator-t8h6v 0/1 Init:1/3 0 28s
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h 2/2 Running 0 64s
]$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-6d67796f46-m9fbv 1/1 Running 0 19h
nvidia-sandbox-validator-t8h6v 0/1 Init:2/3 0 102s
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h 2/2 Running 0 2m18s
]$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-6d67796f46-m9fbv 1/1 Running 0 19h
nvidia-sandbox-validator-t8h6v 1/1 Running 0 2m12s
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h 2/2 Running 0 2m48s
]$ oc describe node node21.redhat.com
Capacity:
cpu: 80
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 937156932Ki
hugepages-1Gi: 4Gi
hugepages-2Mi: 512Mi
memory: 131481720Ki
nvidia.com/GRID_V100D_2Q: 0
nvidia.com/GRID_V100D_4Q: 8
nvidia.com/GV100GL_Tesla_V100: 0
pods: 250
Allocatable:
cpu: 79500m
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 862610085278
hugepages-1Gi: 4Gi
hugepages-2Mi: 512Mi
memory: 125612152Ki
nvidia.com/GRID_V100D_2Q: 0
nvidia.com/GRID_V100D_4Q: 8
nvidia.com/GV100GL_Tesla_V100: 0
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1289m (1%) 4 (5%)
memory 7664Mi (6%) 2Gi (1%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
devices.kubevirt.io/kvm 0 0
devices.kubevirt.io/tun 0 0
devices.kubevirt.io/vhost-net 0 0
nvidia.com/GRID_V100D_2Q 0 0
nvidia.com/GRID_V100D_4Q 0 0
nvidia.com/GV100GL_Tesla_V100 0 0
Summary: As seen in comment 10, Mdevs now do get configured even when Mdev config is added to HCO CR before Nvidia vGPU drivers are installed. Verified with 4.12.0-741 Build. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0408 |
Description of problem: mdevs not configured with drivers installed, if mdev config added to HCO CR before drivers are installed Version-Release number of selected component (if applicable): 4.10.0 How reproducible: 1) When mdev config added to HCO CR before drivers are installed. 2) mdevs do not get configured later, even with drivers installed. Steps to Reproduce: 0) Do not configure the GPU nodes with the NVIDIA Drivers. 1. Update HCO CR with the below config. mediatedDevicesConfiguration: mediatedDevicesTypes: - nvidia-231 - nvidia-232 permittedHostDevices: mediatedDevices: - mdevNameSelector: "GRID T4-2Q" resourceName: "nvidia.com/GRID_T4_2Q" - mdevNameSelector: "GRID T4-4Q" resourceName: "nvidia.com/GRID_T4_4Q" 2. Remove the above HCO CR entry. 3. Configure the GPU nodes with the NVIDIA Drivers. 4. Update HCO CR with the below config. mediatedDevicesConfiguration: mediatedDevicesTypes: - nvidia-231 - nvidia-232 permittedHostDevices: mediatedDevices: - mdevNameSelector: "GRID T4-2Q" resourceName: "nvidia.com/GRID_T4_2Q" - mdevNameSelector: "GRID T4-4Q" resourceName: "nvidia.com/GRID_T4_4Q" Actual results: mdevs not configured with drivers installed, if mdev config added to HCO CR before drivers are installed Expected results: Always, configure the mdevs whenever the drivers are installed Additional info: