Description of problem: mdevs not configured with drivers installed, if mdev config added to HCO CR before drivers are installed Version-Release number of selected component (if applicable): 4.10.0 How reproducible: 1) When mdev config added to HCO CR before drivers are installed. 2) mdevs do not get configured later, even with drivers installed. Steps to Reproduce: 0) Do not configure the GPU nodes with the NVIDIA Drivers. 1. Update HCO CR with the below config. mediatedDevicesConfiguration: mediatedDevicesTypes: - nvidia-231 - nvidia-232 permittedHostDevices: mediatedDevices: - mdevNameSelector: "GRID T4-2Q" resourceName: "nvidia.com/GRID_T4_2Q" - mdevNameSelector: "GRID T4-4Q" resourceName: "nvidia.com/GRID_T4_4Q" 2. Remove the above HCO CR entry. 3. Configure the GPU nodes with the NVIDIA Drivers. 4. Update HCO CR with the below config. mediatedDevicesConfiguration: mediatedDevicesTypes: - nvidia-231 - nvidia-232 permittedHostDevices: mediatedDevices: - mdevNameSelector: "GRID T4-2Q" resourceName: "nvidia.com/GRID_T4_2Q" - mdevNameSelector: "GRID T4-4Q" resourceName: "nvidia.com/GRID_T4_4Q" Actual results: mdevs not configured with drivers installed, if mdev config added to HCO CR before drivers are installed Expected results: Always, configure the mdevs whenever the drivers are installed Additional info:
Correction with, Steps to Reproduce: 0) Do not configure the GPU nodes with the NVIDIA Drivers. 1. Update HCO CR with the below config. mediatedDevicesConfiguration: mediatedDevicesTypes: - nvidia-231 permittedHostDevices: mediatedDevices: - mdevNameSelector: "GRID T4-2Q" resourceName: "nvidia.com/GRID_T4_2Q" 2. Configure the GPU nodes with the NVIDIA Drivers. 3. Remove the above HCO CR entry. 4. Update HCO CR with the below config. mediatedDevicesConfiguration: mediatedDevicesTypes: - nvidia-231 permittedHostDevices: mediatedDevices: - mdevNameSelector: "GRID T4-2Q" resourceName: "nvidia.com/GRID_T4_2Q"
The current workaround is to, 1) Remove the entries of "mediatedDevicesConfiguration" and "permittedHostDevices" from HCO CR and 2) Again update the HCO CR with the desired configuration of "mediatedDevicesConfiguration" and "permittedHostDevices".
Deferring this to the next release due to bandwidth.
@kbidarka the steps to reproduce the issue include removing and re-adding the CR entries after installing the drivers. Then the workaround is to remove and re-add the CR entries after installing the drivers... I'm confused, the steps to reproduce the issue and the steps to fix it seem identical! Please clarify, thanks!
I should have split, reproducer and workaround separately. 0) Do not configure the GPU nodes with the NVIDIA Drivers. 1. Update HCO CR with the below config. mediatedDevicesConfiguration: mediatedDevicesTypes: - nvidia-231 permittedHostDevices: mediatedDevices: - mdevNameSelector: "GRID T4-2Q" resourceName: "nvidia.com/GRID_T4_2Q" 2. Configure the GPU nodes with the NVIDIA Drivers. Notice that the MDEV devices are not created successfully as the NVDIA Drivers were installed after updating the HCO CR with mediatedDevicesConfiguration. Workaround: 1. Remove the above HCO CR entry. 2. Update HCO CR with the below config. mediatedDevicesConfiguration: mediatedDevicesTypes: - nvidia-231 permittedHostDevices: mediatedDevices: - mdevNameSelector: "GRID T4-2Q" resourceName: "nvidia.com/GRID_T4_2Q"
Deferring to 4.12.1 as it's already in the 4.12.0 (release-0.58) merge-pool https://github.com/kubevirt/kubevirt/pull/8809.
Moving to ON_QA and back to 4.12 as this got pulled in by a recent rebase that was required to include the fix for https://bugzilla.redhat.com/show_bug.cgi?id=2139896.
[kbidarka@localhost auth]$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-operator-6d67796f46-m9fbv 1/1 Running 0 19h nvidia-sandbox-validator-jxcbn 1/1 Running 0 2m55s nvidia-vfio-manager-qtcnf 1/1 Running 0 3m31s [kbidarka@localhost auth]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited ]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml ... mediatedDevicesConfiguration: mediatedDevicesTypes: - nvidia-182 permittedHostDevices: mediatedDevices: - mdevNameSelector: GRID V100D-4Q resourceName: nvidia.com/GRID_V100D_4Q ... ---------------------------- ]$ oc label node node21.redhat.com --overwrite nvidia.com/gpu.workload.config=vm-vgpu node/node21.redhat.com labeled ]$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-operator-6d67796f46-m9fbv 1/1 Running 0 19h nvidia-sandbox-validator-jxcbn 1/1 Terminating 0 9m45s nvidia-vfio-manager-qtcnf 1/1 Terminating 0 10m nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h 0/2 Init:0/1 0 4s ]$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-operator-6d67796f46-m9fbv 1/1 Running 0 19h nvidia-sandbox-validator-t8h6v 0/1 Init:1/3 0 28s nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h 2/2 Running 0 64s ]$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-operator-6d67796f46-m9fbv 1/1 Running 0 19h nvidia-sandbox-validator-t8h6v 0/1 Init:2/3 0 102s nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h 2/2 Running 0 2m18s ]$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-operator-6d67796f46-m9fbv 1/1 Running 0 19h nvidia-sandbox-validator-t8h6v 1/1 Running 0 2m12s nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h 2/2 Running 0 2m48s ]$ oc describe node node21.redhat.com Capacity: cpu: 80 devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 937156932Ki hugepages-1Gi: 4Gi hugepages-2Mi: 512Mi memory: 131481720Ki nvidia.com/GRID_V100D_2Q: 0 nvidia.com/GRID_V100D_4Q: 8 nvidia.com/GV100GL_Tesla_V100: 0 pods: 250 Allocatable: cpu: 79500m devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 862610085278 hugepages-1Gi: 4Gi hugepages-2Mi: 512Mi memory: 125612152Ki nvidia.com/GRID_V100D_2Q: 0 nvidia.com/GRID_V100D_4Q: 8 nvidia.com/GV100GL_Tesla_V100: 0 Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1289m (1%) 4 (5%) memory 7664Mi (6%) 2Gi (1%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) devices.kubevirt.io/kvm 0 0 devices.kubevirt.io/tun 0 0 devices.kubevirt.io/vhost-net 0 0 nvidia.com/GRID_V100D_2Q 0 0 nvidia.com/GRID_V100D_4Q 0 0 nvidia.com/GV100GL_Tesla_V100 0 0
Summary: As seen in comment 10, Mdevs now do get configured even when Mdev config is added to HCO CR before Nvidia vGPU drivers are installed. Verified with 4.12.0-741 Build.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0408