Bug 2184435
| Summary: | [cnv-4.12] virt-handler should not delete any pre-configured mediated devices i these are provided by an external provider | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Kedar Bidarkar <kbidarka> |
| Component: | Virtualization | Assignee: | Antonio Cardace <acardace> |
| Status: | VERIFIED --- | QA Contact: | Kedar Bidarkar <kbidarka> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.12.0 | CC: | egallen, sgott |
| Target Milestone: | --- | ||
| Target Release: | 4.12.3 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | hco-bundle-registry-container-v4.12.3-70 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2184440 | ||
|
Description
Kedar Bidarkar
2023-04-04 16:51:50 UTC
Created manual backport at https://github.com/kubevirt/kubevirt/pull/9690. [kbidarka@localhost nvidia-gpu-operator]$ oc get pods
NAME READY STATUS RESTARTS AGE
...
virt-handler-8frk4 1/1 Running 0 30m
virt-handler-f8nrq 1/1 Running 0 30m
virt-handler-fbnbj 1/1 Running 0 30m
...
[kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-8frk4 | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
^C
[kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-f8nrq | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
^C
[kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-fbnbj | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
^C
[kbidarka@localhost nvidia-gpu-operator]$ oc debug node/node3.redhat.com
Temporary namespace openshift-debug-9xf2l is created for debugging node...
Starting pod/node3redhatcom-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.10.133.5
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ls -ltr /sys/bus/mdev/devices/
total 0
lrwxrwxrwx. 1 root root 0 May 16 11:51 f51d8e5d-158f-4eac-88c5-43e6cf353cd9 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.7/f51d8e5d-158f-4eac-88c5-43e6cf353cd9
lrwxrwxrwx. 1 root root 0 May 16 11:51 e13804af-d995-4e91-992c-c26250270d23 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.5/e13804af-d995-4e91-992c-c26250270d23
lrwxrwxrwx. 1 root root 0 May 16 11:51 ac3ed710-e370-4868-9a74-f89f9dca195f -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.4/ac3ed710-e370-4868-9a74-f89f9dca195f
lrwxrwxrwx. 1 root root 0 May 16 11:51 098c800c-a8c0-4973-8f98-3b713b6b385a -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.6/098c800c-a8c0-4973-8f98-3b713b6b385a
lrwxrwxrwx. 1 root root 0 May 16 11:51 f363a09b-cf41-446c-99f5-2c121d2c9558 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.0/f363a09b-cf41-446c-99f5-2c121d2c9558
lrwxrwxrwx. 1 root root 0 May 16 11:51 d358956e-2ae4-42fd-b937-9812b9c98512 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.1/d358956e-2ae4-42fd-b937-9812b9c98512
lrwxrwxrwx. 1 root root 0 May 16 11:51 bdd162c6-4c17-46d1-a757-8844503e16d4 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.3/bdd162c6-4c17-46d1-a757-8844503e16d4
lrwxrwxrwx. 1 root root 0 May 16 11:51 18600cc8-d85f-4c0e-b14f-bac37fbce62b -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.2/18600cc8-d85f-4c0e-b14f-bac37fbce62b
sh-4.4# exit
exit
sh-4.4# exit
exit
Removing debug pod ...
Temporary namespace openshift-debug-9xf2l was removed.
[kbidarka@localhost nvidia-gpu-operator]$ oc debug node/node4.redhat.com
Temporary namespace openshift-debug-k76hh is created for debugging node...
Starting pod/node4redhatcom-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.10.133.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ls -ltr /sys/bus/mdev/devices/
total 0
lrwxrwxrwx. 1 root root 0 May 16 11:51 9e508fa5-0656-4a1c-9aad-59adfaa1cd01 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.4/9e508fa5-0656-4a1c-9aad-59adfaa1cd01
lrwxrwxrwx. 1 root root 0 May 16 11:51 927d2525-16aa-4da3-a2a2-dce06c6c9e22 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.5/927d2525-16aa-4da3-a2a2-dce06c6c9e22
lrwxrwxrwx. 1 root root 0 May 16 11:51 5423342f-f87f-45bf-9a87-2fde8de914b8 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.7/5423342f-f87f-45bf-9a87-2fde8de914b8
lrwxrwxrwx. 1 root root 0 May 16 11:51 090780e8-3648-4e57-a95d-f10ab6dfcc5c -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.6/090780e8-3648-4e57-a95d-f10ab6dfcc5c
lrwxrwxrwx. 1 root root 0 May 16 11:51 c7c9eb01-0b77-4e67-abb7-1be881bcb16b -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.3/c7c9eb01-0b77-4e67-abb7-1be881bcb16b
lrwxrwxrwx. 1 root root 0 May 16 11:51 c3467c3b-10e6-41f4-a5e3-79703fed18da -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.2/c3467c3b-10e6-41f4-a5e3-79703fed18da
lrwxrwxrwx. 1 root root 0 May 16 11:51 4e33d704-d839-4df8-b7f5-fec6926f3917 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.0/4e33d704-d839-4df8-b7f5-fec6926f3917
lrwxrwxrwx. 1 root root 0 May 16 11:51 37c54020-e74f-4d91-9c17-edfd2f59dace -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.1/37c54020-e74f-4d91-9c17-edfd2f59dace
sh-4.4# exit
exit
sh-4.4# exit
exit
Removing debug pod ...
Temporary namespace openshift-debug-k76hh was removed.
[kbidarka@localhost nvidia-gpu-operator]$ oc debug node/node2.redhat.com
Temporary namespace openshift-debug-sthjh is created for debugging node...
Starting pod/node2redhatcom-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.10.133.4
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ls -ltr /sys/bus/mdev/devices/
total 0
lrwxrwxrwx. 1 root root 0 May 16 11:51 a52a1351-dc8b-49ce-8861-2d1625fbde64 -> ../../../devices/pci0000:c9/0000:c9:02.0/0000:ca:00.4/a52a1351-dc8b-49ce-8861-2d1625fbde64
lrwxrwxrwx. 1 root root 0 May 16 11:51 92f13c93-6a8c-492a-863f-da31a57d18ce -> ../../../devices/pci0000:c9/0000:c9:02.0/0000:ca:00.5/92f13c93-6a8c-492a-863f-da31a57d18ce
sh-4.4# exit
exit
sh-4.4# exit
exit
Removing debug pod ...
Temporary namespace openshift-debug-sthjh was removed.
[kbidarka@localhost nvidia-gpu-operator]$ oc -n openshift-cnv get kubevirt kubevirt-kubevirt-hyperconverged -o yaml | grep -A 7 permittedHostDevices
permittedHostDevices:
mediatedDevices:
- externalResourceProvider: true
mdevNameSelector: NVIDIA A2-2Q
resourceName: nvidia.com/GRID_A2_2Q
[kbidarka@localhost nvidia-gpu-operator]$ oc describe node node2.redhat.com | grep nvidia
nvidia.com/NVIDIA_A30-12C: 2
nvidia.com/NVIDIA_A30-12C: 2
nvidia.com/NVIDIA_A30-12C 0 0
[kbidarka@localhost nvidia-gpu-operator]$ oc describe node node3.redhat.com | grep nvidia
nvidia.com/NVIDIA_A2-2Q: 8
nvidia.com/NVIDIA_A2-2Q: 8
nvidia.com/NVIDIA_A2-2Q 0 0
[kbidarka@localhost nvidia-gpu-operator]$ oc describe node node4.redhat.com | grep nvidia
nvidia.com/NVIDIA_A2-2Q: 8
nvidia.com/NVIDIA_A2-2Q: 8
nvidia.com/NVIDIA_A2-2Q 0 0
[kbidarka@localhost nvidia-gpu-operator]$
There was a typo, which I fixed,
[kbidarka@localhost nvidia-gpu-operator]$ oc -n openshift-cnv get kubevirt kubevirt-kubevirt-hyperconverged -o yaml | grep -A 7 permittedHostDevices
permittedHostDevices:
mediatedDevices:
- externalResourceProvider: true
mdevNameSelector: NVIDIA A2-2Q
resourceName: nvidia.com/NVIDIA_A2-2Q
[kbidarka@localhost watchdog]$ oc get vmi
NAME AGE PHASE IP NODENAME READY
vm2-rhel87 32s Running 10.xx.xx.xx node3.redhat.com True
[kbidarka@localhost watchdog]$ virtctl console vm2-rhel87
Successfully connected to vm2-rhel87 console. The escape sequence is ^]
Red Hat Enterprise Linux 8.7 (Ootpa)
Kernel 4.18.0-425.13.1.el8_7.x86_64 on an x86_64
Activate the web console with: systemctl enable --now cockpit.socket
vm2-rhel87 login: cloud-user
Password:
[cloud-user@vm2-rhel87 ~]$ lspci -nnv | grep NVIDIA
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:1649]
[kbidarka@localhost watchdog]$ oc logs -f virt-handler-8frk4 | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) ^C [kbidarka@localhost watchdog]$ oc logs -f virt-handler-f8nrq | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) ^C [kbidarka@localhost watchdog]$ oc logs -f virt-handler-fbnbj | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) ^C [kbidarka@localhost watchdog]$ oc get vmi NAME AGE PHASE IP NODENAME READY vm2-rhel87 11m Running 10.xx.xx.xx node03.redhat.com True --- We no longer see this msg, "Successfully removed mdev" from the virt-handler pods. Moving this bug to VERIFIED state. |