Bug 2128107
Summary: | sriov-manage command fails to enable SRIOV Virtual functions on the Ampere GPU Cards | ||
---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Kedar Bidarkar <kbidarka> |
Component: | Virtualization | Assignee: | sgott |
Status: | CLOSED ERRATA | QA Contact: | Kedar Bidarkar <kbidarka> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.11.0 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | 4.12.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-24 13:40:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Kedar Bidarkar
2022-09-19 21:30:41 UTC
Tested with the following: OpenShift: v4.11.7 OpenShift Virt ( CNV): v4.11.0 NVIDIA GPU Operator: v22.9.0 Nvidia GPU H/W: Ampere A2 Cards Was fixed in the below PR https://gitlab.com/nvidia/container-images/driver/-/merge_requests/199 By installing pciutils in the DTK container. Getting access to a cluster with Ampere GPU cards will take time. But we do plan to verify this during 4.12.0 itself. Also, moving this bug to ON_QA so that we can track this bug/issue. [kbidarka@localhost nvidia-gpu-operator]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited ]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml mediatedDevicesConfiguration: mediatedDevicesTypes: - nvidia-745 permittedHostDevices: mediatedDevices: - mdevNameSelector: NVIDIA A2-2Q resourceName: nvidia.com/GRID_A2_2Q ]$ oc describe node cnv-qe-infra-32.cnvqe2.lab.eng.rdu2.redhat.com Capacity: cpu: 80 devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 584963052Ki hugepages-1Gi: 4Gi hugepages-2Mi: 512Mi memory: 65419676Ki nvidia.com/GRID_A2_2Q: 8 pods: 250 Allocatable: cpu: 79500m devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 538028206007 hugepages-1Gi: 4Gi hugepages-2Mi: 512Mi memory: 59550108Ki nvidia.com/GRID_A2_2Q: 8 pods: 250 [kbidarka@localhost nvidia-gpu-operator]$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-operator-6d67796f46-zcxhg 1/1 Running 0 34m nvidia-sandbox-validator-nbnhz 1/1 Running 0 25m nvidia-sandbox-validator-ntcld 1/1 Running 0 25m nvidia-vgpu-manager-daemonset-412.86.202211290909-0-bcp5z 2/2 Running 0 26m nvidia-vgpu-manager-daemonset-412.86.202211290909-0-z6dd5 2/2 Running 0 26m [kbidarka@localhost nvidia-gpu-operator]$ [kbidarka@localhost nvidia-gpu-operator]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-rc.3 True False 4h47m Cluster version is 4.12.0-rc.3 [kbidarka@localhost nvidia-gpu-operator]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE ... kubevirt-hyperconverged-operator.v4.12.0 OpenShift Virtualization 4.12.0 kubevirt-hyperconverged-operator.v4.11.0 Succeeded ... ]$ oc debug node/cnv-qe-infra-32.cnvqe2.lab.eng.rdu2.redhat.com Starting pod/cnv-qe-infra-32cnvqe2labengrdu2redhatcom-debug ... To use host binaries, run `chroot /host` Pod IP: 10.1.156.40 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# lspci -nnk | NVIDIA sh: NVIDIA: command not found sh-4.4# lspci -nnk | grep NVIDIA d8:00.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:157e] d8:00.4 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:00.5 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:00.6 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:00.7 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.1 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.2 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.3 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.4 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.5 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.6 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:01.7 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:02.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:02.1 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:02.2 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] d8:02.3 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:0000] Summary: sriov-manage command works fine now, to enable SRIOV Virtual functions on the Ampere GPU Cards, as seen from the above output. [kbidarka@localhost ocs]$ virtctl console vm1-rhel86-ocs Successfully connected to vm1-rhel86-ocs console. The escape sequence is ^] Red Hat Enterprise Linux 8.6 (Ootpa) Kernel 4.18.0-372.32.1.el8_6.x86_64 on an x86_64 Activate the web console with: systemctl enable --now cockpit.socket vm1-rhel86-ocs login: cloud-user Password: [cloud-user@vm1-rhel86-ocs ~]$ lspci -nnk | grep NVIDIA 06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:1649] [cloud-user@vm1-rhel86-ocs ~]$ [kbidarka@localhost ocs]$ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0408 |