Bug 2128107
| Summary: | sriov-manage command fails to enable SRIOV Virtual functions on the Ampere GPU Cards | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Kedar Bidarkar <kbidarka> |
| Component: | Virtualization | Assignee: | sgott |
| Status: | CLOSED ERRATA | QA Contact: | Kedar Bidarkar <kbidarka> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.11.0 | Keywords: | Reopened |
| Target Milestone: | --- | ||
| Target Release: | 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-24 13:40:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Kedar Bidarkar
2022-09-19 21:30:41 UTC
Tested with the following: OpenShift: v4.11.7 OpenShift Virt ( CNV): v4.11.0 NVIDIA GPU Operator: v22.9.0 Nvidia GPU H/W: Ampere A2 Cards Was fixed in the below PR https://gitlab.com/nvidia/container-images/driver/-/merge_requests/199 By installing pciutils in the DTK container. Getting access to a cluster with Ampere GPU cards will take time. But we do plan to verify this during 4.12.0 itself. Also, moving this bug to ON_QA so that we can track this bug/issue. [kbidarka@localhost nvidia-gpu-operator]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited
]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml
mediatedDevicesConfiguration:
mediatedDevicesTypes:
- nvidia-745
permittedHostDevices:
mediatedDevices:
- mdevNameSelector: NVIDIA A2-2Q
resourceName: nvidia.com/GRID_A2_2Q
]$ oc describe node cnv-qe-infra-32.cnvqe2.lab.eng.rdu2.redhat.com
Capacity:
cpu: 80
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 584963052Ki
hugepages-1Gi: 4Gi
hugepages-2Mi: 512Mi
memory: 65419676Ki
nvidia.com/GRID_A2_2Q: 8
pods: 250
Allocatable:
cpu: 79500m
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 538028206007
hugepages-1Gi: 4Gi
hugepages-2Mi: 512Mi
memory: 59550108Ki
nvidia.com/GRID_A2_2Q: 8
pods: 250
[kbidarka@localhost nvidia-gpu-operator]$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-6d67796f46-zcxhg 1/1 Running 0 34m
nvidia-sandbox-validator-nbnhz 1/1 Running 0 25m
nvidia-sandbox-validator-ntcld 1/1 Running 0 25m
nvidia-vgpu-manager-daemonset-412.86.202211290909-0-bcp5z 2/2 Running 0 26m
nvidia-vgpu-manager-daemonset-412.86.202211290909-0-z6dd5 2/2 Running 0 26m
[kbidarka@localhost nvidia-gpu-operator]$
[kbidarka@localhost nvidia-gpu-operator]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.0-rc.3 True False 4h47m Cluster version is 4.12.0-rc.3
[kbidarka@localhost nvidia-gpu-operator]$ oc get csv -n openshift-cnv
NAME DISPLAY VERSION REPLACES PHASE
...
kubevirt-hyperconverged-operator.v4.12.0 OpenShift Virtualization 4.12.0 kubevirt-hyperconverged-operator.v4.11.0 Succeeded
...
]$ oc debug node/cnv-qe-infra-32.cnvqe2.lab.eng.rdu2.redhat.com
Starting pod/cnv-qe-infra-32cnvqe2labengrdu2redhatcom-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.1.156.40
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# lspci -nnk | NVIDIA
sh: NVIDIA: command not found
sh-4.4# lspci -nnk | grep NVIDIA
d8:00.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:157e]
d8:00.4 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:00.5 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:00.6 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:00.7 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.1 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.2 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.3 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.4 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.5 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.6 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:01.7 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:02.0 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:02.1 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:02.2 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
d8:02.3 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
Summary: sriov-manage command works fine now, to enable SRIOV Virtual functions on the Ampere GPU Cards, as seen from the above output.
[kbidarka@localhost ocs]$ virtctl console vm1-rhel86-ocs Successfully connected to vm1-rhel86-ocs console. The escape sequence is ^] Red Hat Enterprise Linux 8.6 (Ootpa) Kernel 4.18.0-372.32.1.el8_6.x86_64 on an x86_64 Activate the web console with: systemctl enable --now cockpit.socket vm1-rhel86-ocs login: cloud-user Password: [cloud-user@vm1-rhel86-ocs ~]$ lspci -nnk | grep NVIDIA 06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) Subsystem: NVIDIA Corporation Device [10de:1649] [cloud-user@vm1-rhel86-ocs ~]$ [kbidarka@localhost ocs]$ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0408 |