Bug 2046298

Summary:	mdevs not configured with drivers installed, if mdev config added to HCO CR before drivers are installed
Product:	Container Native Virtualization (CNV)	Reporter:	Kedar Bidarkar <kbidarka>
Component:	Virtualization	Assignee:	Jed Lejosne <jlejosne>
Status:	CLOSED ERRATA	QA Contact:	Akriti Gupta <akrgupta>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.10.0	CC:	acardace, fdeutsch, jlejosne, sgott, ycui
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	hco-bundle-registry-containerv-4.12.0-736	Doc Type:	Known Issue
Doc Text:	If you configure the HyperConverged custom resource (CR) to enable mediated devices before drivers are installed, enablement of mediated devices does not occur. This issue can be triggered by updates. For example, if virt-handler is updated before daemonset, which installs NVIDIA drivers, then nodes cannot provide virtual machine GPUs. (BZ#2046298) As a workaround: 1. Remove mediatedDevicesConfiguration and permittedHostDevices from the HyperConverged CR. 2. Update both mediatedDevicesConfiguration and permittedHostDevices stanzas with the configuration you want to use.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-24 13:36:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Kedar Bidarkar 2022-01-26 14:31:30 UTC

Description of problem:
mdevs not configured with drivers installed, if mdev config added to HCO CR before drivers are installed

Version-Release number of selected component (if applicable):
4.10.0

How reproducible:
1) When mdev config added to HCO CR before drivers are installed.
2) mdevs do not get configured later, even with drivers installed.



Steps to Reproduce:
0) Do not configure the GPU nodes with the NVIDIA Drivers.
1. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
      - nvidia-232
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"
    - mdevNameSelector: "GRID T4-4Q"
      resourceName: "nvidia.com/GRID_T4_4Q"
2. Remove the above HCO CR entry.
3. Configure the GPU nodes with the NVIDIA Drivers.
4. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
      - nvidia-232
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"
    - mdevNameSelector: "GRID T4-4Q"
      resourceName: "nvidia.com/GRID_T4_4Q"

Actual results:
mdevs not configured with drivers installed, if mdev config added to HCO CR before drivers are installed

Expected results:
Always, configure the mdevs whenever the drivers are installed

Additional info:

Comment 1 Kedar Bidarkar 2022-01-26 14:50:01 UTC

Correction with, Steps to Reproduce:


0) Do not configure the GPU nodes with the NVIDIA Drivers.
1. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"
2. Configure the GPU nodes with the NVIDIA Drivers.
3. Remove the above HCO CR entry.
4. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"

Comment 3 Kedar Bidarkar 2022-02-01 22:14:17 UTC

The current workaround is to,
1) Remove the entries of "mediatedDevicesConfiguration" and "permittedHostDevices" from HCO CR and
2) Again update the HCO CR with the desired configuration of "mediatedDevicesConfiguration" and "permittedHostDevices".

Comment 5 sgott 2022-05-27 13:27:00 UTC

Deferring this to the next release due to bandwidth.

Comment 6 Jed Lejosne 2022-09-07 19:28:39 UTC

@kbidarka the steps to reproduce the issue include removing and re-adding the CR entries after installing the drivers.
Then the workaround is to remove and re-add the CR entries after installing the drivers...
I'm confused, the steps to reproduce the issue and the steps to fix it seem identical!
Please clarify, thanks!

Comment 7 Kedar Bidarkar 2022-10-27 10:18:52 UTC

I should have split, reproducer and workaround separately.

0) Do not configure the GPU nodes with the NVIDIA Drivers.
1. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"
2. Configure the GPU nodes with the NVIDIA Drivers.

Notice that the MDEV devices are not created successfully as the NVDIA Drivers were installed after updating the HCO CR with mediatedDevicesConfiguration.

Workaround:

1. Remove the above HCO CR entry.
2. Update HCO CR with the below config.
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
      - nvidia-231
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID T4-2Q"
      resourceName: "nvidia.com/GRID_T4_2Q"

Comment 8 Antonio Cardace 2022-11-18 16:07:16 UTC

Deferring to 4.12.1 as it's already in the 4.12.0 (release-0.58) merge-pool https://github.com/kubevirt/kubevirt/pull/8809.

Comment 9 Antonio Cardace 2022-11-23 10:06:06 UTC

Moving to ON_QA and back to 4.12 as this got pulled in by a recent rebase that was required to include the fix for https://bugzilla.redhat.com/show_bug.cgi?id=2139896.

Comment 10 Kedar Bidarkar 2022-11-30 08:27:34 UTC

[kbidarka@localhost auth]$ oc get pods -n nvidia-gpu-operator
NAME                             READY   STATUS    RESTARTS   AGE
gpu-operator-6d67796f46-m9fbv    1/1     Running   0          19h
nvidia-sandbox-validator-jxcbn   1/1     Running   0          2m55s
nvidia-vfio-manager-qtcnf        1/1     Running   0          3m31s

[kbidarka@localhost auth]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited

]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml
  ...
  mediatedDevicesConfiguration:
    mediatedDevicesTypes:
    - nvidia-182
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: GRID V100D-4Q
      resourceName: nvidia.com/GRID_V100D_4Q
  ...

----------------------------

]$ oc label node node21.redhat.com --overwrite nvidia.com/gpu.workload.config=vm-vgpu
node/node21.redhat.com labeled

]$ oc get pods -n nvidia-gpu-operator
NAME                                                        READY   STATUS        RESTARTS   AGE
gpu-operator-6d67796f46-m9fbv                               1/1     Running       0          19h
nvidia-sandbox-validator-jxcbn                              1/1     Terminating   0          9m45s
nvidia-vfio-manager-qtcnf                                   1/1     Terminating   0          10m
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h   0/2     Init:0/1      0          4s


]$ oc get pods -n nvidia-gpu-operator
NAME                                                        READY   STATUS     RESTARTS   AGE
gpu-operator-6d67796f46-m9fbv                               1/1     Running    0          19h
nvidia-sandbox-validator-t8h6v                              0/1     Init:1/3   0          28s
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h   2/2     Running    0          64s


]$ oc get pods -n nvidia-gpu-operator
NAME                                                        READY   STATUS     RESTARTS   AGE
gpu-operator-6d67796f46-m9fbv                               1/1     Running    0          19h
nvidia-sandbox-validator-t8h6v                              0/1     Init:2/3   0          102s
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h   2/2     Running    0          2m18s

]$ oc get pods -n nvidia-gpu-operator
NAME                                                        READY   STATUS    RESTARTS   AGE
gpu-operator-6d67796f46-m9fbv                               1/1     Running   0          19h
nvidia-sandbox-validator-t8h6v                              1/1     Running   0          2m12s
nvidia-vgpu-manager-daemonset-412.86.202211142021-0-hrj9h   2/2     Running   0          2m48s

]$ oc describe node node21.redhat.com

Capacity:
  cpu:                            80
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              937156932Ki
  hugepages-1Gi:                  4Gi
  hugepages-2Mi:                  512Mi
  memory:                         131481720Ki
  nvidia.com/GRID_V100D_2Q:       0
  nvidia.com/GRID_V100D_4Q:       8
  nvidia.com/GV100GL_Tesla_V100:  0
  pods:                           250
Allocatable:
  cpu:                            79500m
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              862610085278
  hugepages-1Gi:                  4Gi
  hugepages-2Mi:                  512Mi
  memory:                         125612152Ki
  nvidia.com/GRID_V100D_2Q:       0
  nvidia.com/GRID_V100D_4Q:       8
  nvidia.com/GV100GL_Tesla_V100:  0

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests     Limits
  --------                       --------     ------
  cpu                            1289m (1%)   4 (5%)
  memory                         7664Mi (6%)  2Gi (1%)
  ephemeral-storage              0 (0%)       0 (0%)
  hugepages-1Gi                  0 (0%)       0 (0%)
  hugepages-2Mi                  0 (0%)       0 (0%)
  devices.kubevirt.io/kvm        0            0
  devices.kubevirt.io/tun        0            0
  devices.kubevirt.io/vhost-net  0            0
  nvidia.com/GRID_V100D_2Q       0            0
  nvidia.com/GRID_V100D_4Q       0            0
  nvidia.com/GV100GL_Tesla_V100  0            0

Comment 11 Kedar Bidarkar 2022-11-30 08:33:34 UTC

Summary: As seen in comment 10,
Mdevs now do get configured even when Mdev config is added to HCO CR before Nvidia vGPU drivers are installed.

Verified with 4.12.0-741 Build.

Comment 15 errata-xmlrpc 2023-01-24 13:36:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0408