Bug 2046298 - mdevs not configured with drivers installed, if mdev config added to HCO CR before drivers are installed
Summary: mdevs not configured with drivers installed, if mdev config added to HCO CR b...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Jed Lejosne
QA Contact: Akriti Gupta
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-26 14:31 UTC by Kedar Bidarkar
Modified: 2023-01-24 13:36 UTC (History)
5 users (show)

Fixed In Version: hco-bundle-registry-containerv-4.12.0-736
Doc Type: Known Issue
Doc Text:
If you configure the HyperConverged custom resource (CR) to enable mediated devices before drivers are installed, enablement of mediated devices does not occur. This issue can be triggered by updates. For example, if virt-handler is updated before daemonset, which installs NVIDIA drivers, then nodes cannot provide virtual machine GPUs. (BZ#2046298) As a workaround: 1. Remove mediatedDevicesConfiguration and permittedHostDevices from the HyperConverged CR. 2. Update both mediatedDevicesConfiguration and permittedHostDevices stanzas with the configuration you want to use.
Clone Of:
Environment:
Last Closed: 2023-01-24 13:36:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 7190 0 None Draft update mdev devices when the layout on the host changes 2022-09-12 13:05:50 UTC
Github kubevirt kubevirt pull 8481 0 None Merged Always refresh the mediated devices 2022-11-18 09:00:11 UTC
Github kubevirt kubevirt pull 8809 0 None Merged [release-0.58] Always refresh the mediated devices 2022-11-23 10:03:34 UTC
Red Hat Issue Tracker CNV-16022 0 None None None 2022-10-27 11:53:51 UTC

Description Kedar Bidarkar 2022-01-26 14:31:30 UTC
Description of problem:
mdevs not configured with drivers installed, if mdev config added to HCO CR before drivers are installed

Version-Release number of selected component (if applicable):
4.10.0

How reproducible:
1) When mdev config added to HCO CR before drivers are installed.
2) mdevs do not get configured later, even with drivers installed.



Steps to Reproduce:
0) Do not configure the GPU nodes with the NVIDIA Drivers.
1. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
      - nvidia-232
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"
    - mdevNameSelector: "GRID T4-4Q"
      resourceName: "nvidia.com/GRID_T4_4Q"
2. Remove the above HCO CR entry.
3. Configure the GPU nodes with the NVIDIA Drivers.
4. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
      - nvidia-232
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"
    - mdevNameSelector: "GRID T4-4Q"
      resourceName: "nvidia.com/GRID_T4_4Q"

Actual results:
mdevs not configured with drivers installed, if mdev config added to HCO CR before drivers are installed

Expected results:
Always, configure the mdevs whenever the drivers are installed

Additional info:

Comment 1 Kedar Bidarkar 2022-01-26 14:50:01 UTC
Correction with, Steps to Reproduce:


0) Do not configure the GPU nodes with the NVIDIA Drivers.
1. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"
2. Configure the GPU nodes with the NVIDIA Drivers.
3. Remove the above HCO CR entry.
4. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"

Comment 3 Kedar Bidarkar 2022-02-01 22:14:17 UTC
The current workaround is to,
1) Remove the entries of "mediatedDevicesConfiguration" and "permittedHostDevices" from HCO CR and
2) Again update the HCO CR with the desired configuration of "mediatedDevicesConfiguration" and "permittedHostDevices".

Comment 5 sgott 2022-05-27 13:27:00 UTC
Deferring this to the next release due to bandwidth.

Comment 6 Jed Lejosne 2022-09-07 19:28:39 UTC
@kbidarka the steps to reproduce the issue include removing and re-adding the CR entries after installing the drivers.
Then the workaround is to remove and re-add the CR entries after installing the drivers...
I'm confused, the steps to reproduce the issue and the steps to fix it seem identical!
Please clarify, thanks!

Comment 7 Kedar Bidarkar 2022-10-27 10:18:52 UTC
I should have split, reproducer and workaround separately.

0) Do not configure the GPU nodes with the NVIDIA Drivers.
1. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"
2. Configure the GPU nodes with the NVIDIA Drivers.

Notice that the MDEV devices are not created successfully as the NVDIA Drivers were installed after updating the HCO CR with mediatedDevicesConfiguration.

Workaround:

1. Remove the above HCO CR entry.
2. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"

Comment 8 Antonio Cardace 2022-11-18 16:07:16 UTC
Deferring to 4.12.1 as it's already in the 4.12.0 (release-0.58) merge-pool https://github.com/kubevirt/kubevirt/pull/8809.

Comment 9 Antonio Cardace 2022-11-23 10:06:06 UTC
Moving to ON_QA and back to 4.12 as this got pulled in by a recent rebase that was required to include the fix for https://bugzilla.redhat.com/show_bug.cgi?id=2139896.

Comment 10 Kedar Bidarkar 2022-11-30 08:27:34 UTC
[kbidarka@localhost auth]$ oc get pods -n nvidia-gpu-operator
NAME                             READY   STATUS    RESTARTS   AGE
gpu-operator-6d67796f46-m9fbv    1/1     Running   0          19h
nvidia-sandbox-validator-jxcbn   1/1     Running   0          2m55s
nvidia-vfio-manager-qtcnf        1/1     Running   0          3m31s

[kbidarka@localhost auth]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited

]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml
  ...
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
    - nvidia-182
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: GRID V100D-4Q
      resourceName: nvidia.com/GRID_V100D_4Q
  ...

----------------------------

]$ oc label node node21.redhat.com --overwrite nvidia.com/gpu.workload.config=vm-vgpu
node/node21.redhat.com labeled

]$ oc get pods -n nvidia-gpu-operator
NAME                                                        READY   STATUS        RESTARTS   AGE
gpu-operator-6d67796f46-m9fbv                               1/1     Running       0          19h
nvidia-sandbox-validator-jxcbn                              1/1     Terminating   0          9m45s
nvidia-vfio-manager-qtcnf                                   1/1     Terminating   0          10m
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h   0/2     Init:0/1      0          4s


]$ oc get pods -n nvidia-gpu-operator
NAME                                                        READY   STATUS     RESTARTS   AGE
gpu-operator-6d67796f46-m9fbv                               1/1     Running    0          19h
nvidia-sandbox-validator-t8h6v                              0/1     Init:1/3   0          28s
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h   2/2     Running    0          64s


]$ oc get pods -n nvidia-gpu-operator
NAME                                                        READY   STATUS     RESTARTS   AGE
gpu-operator-6d67796f46-m9fbv                               1/1     Running    0          19h
nvidia-sandbox-validator-t8h6v                              0/1     Init:2/3   0          102s
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h   2/2     Running    0          2m18s

]$ oc get pods -n nvidia-gpu-operator
NAME                                                        READY   STATUS    RESTARTS   AGE
gpu-operator-6d67796f46-m9fbv                               1/1     Running   0          19h
nvidia-sandbox-validator-t8h6v                              1/1     Running   0          2m12s
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h   2/2     Running   0          2m48s

]$ oc describe node node21.redhat.com

Capacity:
  cpu:                            80
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              937156932Ki
  hugepages-1Gi:                  4Gi
  hugepages-2Mi:                  512Mi
  memory:                         131481720Ki
  nvidia.com/GRID_V100D_2Q:       0
  nvidia.com/GRID_V100D_4Q:       8
  nvidia.com/GV100GL_Tesla_V100:  0
  pods:                           250
Allocatable:
  cpu:                            79500m
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              862610085278
  hugepages-1Gi:                  4Gi
  hugepages-2Mi:                  512Mi
  memory:                         125612152Ki
  nvidia.com/GRID_V100D_2Q:       0
  nvidia.com/GRID_V100D_4Q:       8
  nvidia.com/GV100GL_Tesla_V100:  0

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests     Limits
  --------                       --------     ------
  cpu                            1289m (1%)   4 (5%)
  memory                         7664Mi (6%)  2Gi (1%)
  ephemeral-storage              0 (0%)       0 (0%)
  hugepages-1Gi                  0 (0%)       0 (0%)
  hugepages-2Mi                  0 (0%)       0 (0%)
  devices.kubevirt.io/kvm        0            0
  devices.kubevirt.io/tun        0            0
  devices.kubevirt.io/vhost-net  0            0
  nvidia.com/GRID_V100D_2Q       0            0
  nvidia.com/GRID_V100D_4Q       0            0
  nvidia.com/GV100GL_Tesla_V100  0            0

Comment 11 Kedar Bidarkar 2022-11-30 08:33:34 UTC
Summary: As seen in comment 10,
Mdevs now do get configured even when Mdev config is added to HCO CR before Nvidia vGPU drivers are installed.

Verified with 4.12.0-741 Build.

Comment 15 errata-xmlrpc 2023-01-24 13:36:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0408


Note You need to log in before you can comment on or make changes to this bug.