2128107 – sriov-manage command fails to enable SRIOV Virtual functions on the Ampere GPU Cards

Bug 2128107 - sriov-manage command fails to enable SRIOV Virtual functions on the Ampere GPU Cards

Summary: sriov-manage command fails to enable SRIOV Virtual functions on the Ampere GP...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	sgott
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-09-19 21:30 UTC by Kedar Bidarkar
Modified:	2023-01-24 13:40 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-24 13:40:50 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	nvidia/container-images driver merge_requests 199	None	merged	vgpu-manager: ensure pciutils pkg is installed as 'lspci' is needed by 'sriov-manage' script	2022-10-17 17:30:10 UTC
Red Hat Issue Tracker	CNV-21343	None	None	None	2022-12-09 11:15:01 UTC
Red Hat Product Errata	RHSA-2023:0408	None	None	None	2023-01-24 13:40:57 UTC

Description Kedar Bidarkar 2022-09-19 21:30:41 UTC

Description of problem:

After configuring Nvidia GPU Operator, 
The below pods were not found, for Ampere based GPU Cards.

nvidia-sandbox-device-plugin-daemonset-5rsv9                
nvidia-sandbox-device-plugin-daemonset-q225z                
nvidia-sandbox-validator-996wt                              
nvidia-sandbox-validator-shwj9    

Probably this is because the "lspci" command was not found in the container "openshift-driver-toolkit-ctr" in pod  "nvidia-vgpu-manager-daemonset-411.86.202208031059-0"

[kbidarka@localhost nvidia-gpu-operator]$ oc logs -c openshift-driver-toolkit-ctr -f nvidia-vgpu-manager-daemonset-411.86.202208031059-0-8wfxh | grep -A 5 "sriov-manage"
+ /usr/lib/nvidia/sriov-manage -e ALL
/usr/lib/nvidia/sriov-manage: line 259: lspci: command not found
+ return 0
Done, now waiting for signal
+ echo 'Done, now waiting for signal'
+ trap 'echo '\''Caught signal'\''; _shutdown; trap - EXIT; exit' HUP INT QUIT PIPE TERM
+ true

Version-Release number of selected component (if applicable):


How reproducible:
Installing Nvidia GPU Operator on Ampere GPU Architecture.

Steps to Reproduce:
1.
2.
3.

Actual results:
+ /usr/lib/nvidia/sriov-manage -e ALL
/usr/lib/nvidia/sriov-manage: line 259: lspci: command not found



Expected results:
+ /usr/lib/nvidia/sriov-manage -e ALL

The above command should run fine.

Additional info:

Workaround:
1) Install pciutils package in the "openshift-driver-toolkit-ctr"
2) and then label the node with "vgpu.config=<MDEV-TYPE"



1) oc -n nvidia-gpu-operator exec pod/nvidia-vgpu-manager-daemonset-411.86.202208031059-0-8wfxh -it -c openshift-driver-toolkit-ctr -- /bin/sh -euxc 'dnf install -y pciutils; /usr/lib/nvidia/sriov-manage -e ALL' ; oc -n nvidia-gpu-operator exec pod/nvidia-vgpu-manager-daemonset-411.86.202208031059-0-vk7pw -it -c openshift-driver-toolkit-ctr -- /bin/sh -euxc 'dnf install -y pciutils; /usr/lib/nvidia/sriov-manage -e ALL'

2)  oc label node node32.redhat.com --overwrite nvidia.com/vgpu.config=A2-2Q ; oc label node node33.redhat.com --overwrite nvidia.com/vgpu.config=A2-2Q

Comment 1 Kedar Bidarkar 2022-10-12 12:19:43 UTC

Tested with the following:
OpenShift: v4.11.7
OpenShift Virt ( CNV): v4.11.0
NVIDIA GPU Operator: v22.9.0
Nvidia GPU H/W: Ampere A2 Cards

Comment 2 Kedar Bidarkar 2022-10-17 17:30:10 UTC

Was fixed in the below PR 
https://gitlab.com/nvidia/container-images/driver/-/merge_requests/199

By installing pciutils in the DTK container.

Comment 3 Kedar Bidarkar 2022-10-19 11:55:23 UTC

Getting access to a cluster with Ampere GPU cards will take time.

But we do plan to verify this during 4.12.0 itself.

Also, moving this bug to ON_QA so that we can track this bug/issue.

Comment 4 Kedar Bidarkar 2022-12-09 11:03:35 UTC

[kbidarka@localhost nvidia-gpu-operator]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv 
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited

]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv  -o yaml
mediatedDevicesConfiguration:
    mediatedDevicesTypes:
    - nvidia-745
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: NVIDIA A2-2Q
      resourceName: nvidia.com/GRID_A2_2Q

]$ oc describe node cnv-qe-infra-32.cnvqe2.lab.eng.rdu2.redhat.com      
Capacity:
  cpu:                            80
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              584963052Ki
  hugepages-1Gi:                  4Gi
  hugepages-2Mi:                  512Mi
  memory:                         65419676Ki
  nvidia.com/GRID_A2_2Q:          8
  pods:                           250
Allocatable:
  cpu:                            79500m
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              538028206007
  hugepages-1Gi:                  4Gi
  hugepages-2Mi:                  512Mi
  memory:                         59550108Ki
  nvidia.com/GRID_A2_2Q:          8
  pods:                           250


[kbidarka@localhost nvidia-gpu-operator]$ oc get pods -n nvidia-gpu-operator
NAME                                                        READY   STATUS    RESTARTS   AGE
gpu-operator-6d67796f46-zcxhg                               1/1     Running   0          34m
nvidia-sandbox-validator-nbnhz                              1/1     Running   0          25m
nvidia-sandbox-validator-ntcld                              1/1     Running   0          25m
nvidia-vgpu-manager-daemonset-412.86.202211290909-0-bcp5z   2/2     Running   0          26m
nvidia-vgpu-manager-daemonset-412.86.202211290909-0-z6dd5   2/2     Running   0          26m
[kbidarka@localhost nvidia-gpu-operator]$ 


[kbidarka@localhost nvidia-gpu-operator]$ oc get clusterversion 
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-rc.3   True        False         4h47m   Cluster version is 4.12.0-rc.3


[kbidarka@localhost nvidia-gpu-operator]$ oc get csv -n openshift-cnv 
NAME                                       DISPLAY                                          VERSION    REPLACES                                   PHASE
...
kubevirt-hyperconverged-operator.v4.12.0   OpenShift Virtualization                         4.12.0     kubevirt-hyperconverged-operator.v4.11.0   Succeeded
...


]$ oc debug node/cnv-qe-infra-32.cnvqe2.lab.eng.rdu2.redhat.com
Starting pod/cnv-qe-infra-32cnvqe2labengrdu2redhatcom-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.1.156.40
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# lspci -nnk | NVIDIA 
sh: NVIDIA: command not found
sh-4.4# lspci -nnk | grep  NVIDIA 
d8:00.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:157e]
d8:00.4 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:00.5 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:00.6 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:00.7 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.1 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.2 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.3 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.4 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.5 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.6 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.7 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:02.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:02.1 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:02.2 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]
d8:02.3 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:0000]


Summary: sriov-manage command works fine now, to enable SRIOV Virtual functions on the Ampere GPU Cards, as seen from the above output.

Comment 6 Kedar Bidarkar 2022-12-09 11:20:58 UTC

[kbidarka@localhost ocs]$ virtctl console vm1-rhel86-ocs
Successfully connected to vm1-rhel86-ocs console. The escape sequence is ^]

Red Hat Enterprise Linux 8.6 (Ootpa)
Kernel 4.18.0-372.32.1.el8_6.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm1-rhel86-ocs login: cloud-user
Password: 
[cloud-user@vm1-rhel86-ocs ~]$ lspci -nnk | grep NVIDIA 
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:1649]
[cloud-user@vm1-rhel86-ocs ~]$ [kbidarka@localhost ocs]$

Comment 10 errata-xmlrpc 2023-01-24 13:40:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0408

Note You need to log in before you can comment on or make changes to this bug.