Description of problem: After configuring Nvidia GPU Operator, The below pods were not found, for Ampere based GPU Cards. nvidia-sandbox-device-plugin-daemonset-5rsv9 nvidia-sandbox-device-plugin-daemonset-q225z nvidia-sandbox-validator-996wt nvidia-sandbox-validator-shwj9 Probably this is because the "lspci" command was not found in the container "openshift-driver-toolkit-ctr" in pod "nvidia-vgpu-manager-daemonset-411.86.202208031059-0" [kbidarka@localhost nvidia-gpu-operator]$ oc logs -c openshift-driver-toolkit-ctr -f nvidia-vgpu-manager-daemonset-411.86.202208031059-0-8wfxh | grep -A 5 "sriov-manage" + /usr/lib/nvidia/sriov-manage -e ALL /usr/lib/nvidia/sriov-manage: line 259: lspci: command not found + return 0 Done, now waiting for signal + echo 'Done, now waiting for signal' + trap 'echo '\''Caught signal'\''; _shutdown; trap - EXIT; exit' HUP INT QUIT PIPE TERM + true Version-Release number of selected component (if applicable): How reproducible: Installing Nvidia GPU Operator on Ampere GPU Architecture. Steps to Reproduce: 1. 2. 3. Actual results: + /usr/lib/nvidia/sriov-manage -e ALL /usr/lib/nvidia/sriov-manage: line 259: lspci: command not found Expected results: + /usr/lib/nvidia/sriov-manage -e ALL The above command should run fine. Additional info: Workaround: 1) Install pciutils package in the "openshift-driver-toolkit-ctr" 2) and then label the node with "vgpu.config=<MDEV-TYPE" 1) oc -n nvidia-gpu-operator exec pod/nvidia-vgpu-manager-daemonset-411.86.202208031059-0-8wfxh -it -c openshift-driver-toolkit-ctr -- /bin/sh -euxc 'dnf install -y pciutils; /usr/lib/nvidia/sriov-manage -e ALL' ; oc -n nvidia-gpu-operator exec pod/nvidia-vgpu-manager-daemonset-411.86.202208031059-0-vk7pw -it -c openshift-driver-toolkit-ctr -- /bin/sh -euxc 'dnf install -y pciutils; /usr/lib/nvidia/sriov-manage -e ALL' 2) oc label node node32.redhat.com --overwrite nvidia.com/vgpu.config=A2-2Q ; oc label node node33.redhat.com --overwrite nvidia.com/vgpu.config=A2-2Q
Tested with the following: OpenShift: v4.11.7 OpenShift Virt ( CNV): v4.11.0 NVIDIA GPU Operator: v22.9.0 Nvidia GPU H/W: Ampere A2 Cards
Was fixed in the below PR https://gitlab.com/nvidia/container-images/driver/-/merge_requests/199 By installing pciutils in the DTK container.
Getting access to a cluster with Ampere GPU cards will take time. But we do plan to verify this during 4.12.0 itself. Also, moving this bug to ON_QA so that we can track this bug/issue.
[kbidarka@localhost nvidia-gpu-operator]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited ]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml mediatedDevicesConfiguration: mediatedDevicesTypes: - nvidia-745 permittedHostDevices: mediatedDevices: - mdevNameSelector: NVIDIA A2-2Q resourceName: nvidia.com/GRID_A2_2Q ]$ oc describe node cnv-qe-infra-32.cnvqe2.lab.eng.rdu2.redhat.com Capacity: cpu: 80 devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 584963052Ki hugepages-1Gi: 4Gi hugepages-2Mi: 512Mi memory: 65419676Ki nvidia.com/GRID_A2_2Q: 8 pods: 250 Allocatable: cpu: 79500m devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 538028206007 hugepages-1Gi: 4Gi hugepages-2Mi: 512Mi memory: 59550108Ki nvidia.com/GRID_A2_2Q: 8 pods: 250 [kbidarka@localhost nvidia-gpu-operator]$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-operator-6d67796f46-zcxhg 1/1 Running 0 34m nvidia-sandbox-validator-nbnhz 1/1 Running 0 25m nvidia-sandbox-validator-ntcld 1/1 Running 0 25m nvidia-vgpu-manager-daemonset-412.86.202211290909-0-bcp5z 2/2 Running 0 26m nvidia-vgpu-manager-daemonset-412.86.202211290909-0-z6dd5 2/2 Running 0 26m [kbidarka@localhost nvidia-gpu-operator]$ [kbidarka@localhost nvidia-gpu-operator]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-rc.3 True False 4h47m Cluster version is 4.12.0-rc.3 [kbidarka@localhost nvidia-gpu-operator]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE ... kubevirt-hyperconverged-operator.v4.12.0 OpenShift Virtualization 4.12.0 kubevirt-hyperconverged-operator.v4.11.0 Succeeded ... ]$ oc debug node/cnv-qe-infra-32.cnvqe2.lab.eng.rdu2.redhat.com Starting pod/cnv-qe-infra-32cnvqe2labengrdu2redhatcom-debug ... To use host binaries, run `chroot /host` Pod IP: 10.1.156.40 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# lspci -nnk | NVIDIA sh: NVIDIA: command not found sh-4.4# lspci -nnk | grep NVIDIA d8:00.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:157e] d8:00.4 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:00.5 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:00.6 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:00.7 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.1 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.2 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.3 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.4 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.5 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.6 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.7 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:02.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:02.1 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:02.2 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:02.3 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] Summary: sriov-manage command works fine now, to enable SRIOV Virtual functions on the Ampere GPU Cards, as seen from the above output.
[kbidarka@localhost ocs]$ virtctl console vm1-rhel86-ocs Successfully connected to vm1-rhel86-ocs console. The escape sequence is ^] Red Hat Enterprise Linux 8.6 (Ootpa) Kernel 4.18.0-372.32.1.el8_6.x86_64 on an x86_64 Activate the web console with: systemctl enable --now cockpit.socket vm1-rhel86-ocs login: cloud-user Password: [cloud-user@vm1-rhel86-ocs ~]$ lspci -nnk | grep NVIDIA 06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:1649] [cloud-user@vm1-rhel86-ocs ~]$ [kbidarka@localhost ocs]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0408