This bug was initially created as a copy of Bug #2169880 I am copying this bug because: Description of problem: Virt-handler deletes any pre-configured mediated device even if nothing is configured under spec.configuration.mediatedDevicesConfiguration. On a default installation of OCP Virt 4.12, the virt-handler pod is deleting any mdev device that is created on the system. This is reproducible with an empty permittedHostDevices configuration: permittedHostDevices: {} and with the following config where externalResourceProvider=true is explicitly set for the mdev device. permittedHostDevices: mediatedDevices: - externalResourceProvider: true mdevNameSelector: NVIDIA A10-24Q resourceName: nvidia.com/NVIDIA_A10-24Q Consider the following pre-configured mdev (vGPU) devices: [core@cnt-a100-bm ~]$ ls -ltr /sys/bus/mdev/devices/ total 0 lrwxrwxrwx. 1 root root 0 Feb 8 18:46 63b0b313-a62f-4475-b274-c26dd7defbcd -> ../../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.4/63b0b313-a62f-4475-b274-c26dd7defbcd lrwxrwxrwx. 1 root root 0 Feb 8 18:46 203276d5-ac06-4585-baf3-ff16e119d634 -> ../../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.5/203276d5-ac06-4585-baf3-ff16e119d634 lrwxrwxrwx. 1 root root 0 Feb 8 18:46 f4fd3e66-f062-45dd-8ec0-39a7a2201490 -> ../../../devices/pci0000:d7/0000:d7:00.0/0000:d8:00.5/f4fd3e66-f062-45dd-8ec0-39a7a2201490 lrwxrwxrwx. 1 root root 0 Feb 8 18:46 2bd56759-c812-4784-9aea-3d9df23d15d3 -> ../../../devices/pci0000:d7/0000:d7:00.0/0000:d8:00.4/2bd56759-c812-4784-9aea-3d9df23d15d3 The devices get deleted by virt-handler shortly after: [core@cnt-a100-bm ~]$ ls -ltr /sys/bus/mdev/devices/ total 0 [core@cnt-a100-bm ~]$ oc logs -n openshift-cnv virt-handler-fp426 . . . {"component":"virt-handler","level":"info","msg":"resyncing virt-launcher domains","pos":"cache.go:385","timestamp":"2023-02-08T18:44:44.363329Z"} {"component":"virt-handler","level":"info","msg":"refreshed device plugins for permitted/forbidden host devices","pos":"device_controller.go:320","timestamp":"2023-02-08T18:45:07.702002Z"} {"component":"virt-handler","level":"info","msg":"enabled device-plugins for: []","pos":"device_controller.go:321","timestamp":"2023-02-08T18:45:07.702053Z"} {"component":"virt-handler","level":"info","msg":"disabled device-plugins for: []","pos":"device_controller.go:322","timestamp":"2023-02-08T18:45:07.702064Z"} {"component":"virt-handler","level":"info","msg":"Successfully removed mdev 203276d5-ac06-4585-baf3-ff16e119d634","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.327329Z"} {"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 203276d5-ac06-4585-baf3-ff16e119d634","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.327396Z"} {"component":"virt-handler","level":"info","msg":"Successfully removed mdev 2bd56759-c812-4784-9aea-3d9df23d15d3","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.375248Z"} {"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 2bd56759-c812-4784-9aea-3d9df23d15d3","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.375311Z"} {"component":"virt-handler","level":"info","msg":"Successfully removed mdev 63b0b313-a62f-4475-b274-c26dd7defbcd","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.411902Z"} {"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 63b0b313-a62f-4475-b274-c26dd7defbcd","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.411965Z"} {"component":"virt-handler","level":"info","msg":"Successfully removed mdev f4fd3e66-f062-45dd-8ec0-39a7a2201490","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.444875Z"} {"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: f4fd3e66-f062-45dd-8ec0-39a7a2201490","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.444917Z"} Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created manual backport at https://github.com/kubevirt/kubevirt/pull/9690.
[kbidarka@localhost nvidia-gpu-operator]$ oc get pods NAME READY STATUS RESTARTS AGE ... virt-handler-8frk4 1/1 Running 0 30m virt-handler-f8nrq 1/1 Running 0 30m virt-handler-fbnbj 1/1 Running 0 30m ... [kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-8frk4 | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) ^C [kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-f8nrq | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) ^C [kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-fbnbj | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) ^C [kbidarka@localhost nvidia-gpu-operator]$ oc debug node/node3.redhat.com Temporary namespace openshift-debug-9xf2l is created for debugging node... Starting pod/node3redhatcom-debug ... To use host binaries, run `chroot /host` Pod IP: 10.10.133.5 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls -ltr /sys/bus/mdev/devices/ total 0 lrwxrwxrwx. 1 root root 0 May 16 11:51 f51d8e5d-158f-4eac-88c5-43e6cf353cd9 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.7/f51d8e5d-158f-4eac-88c5-43e6cf353cd9 lrwxrwxrwx. 1 root root 0 May 16 11:51 e13804af-d995-4e91-992c-c26250270d23 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.5/e13804af-d995-4e91-992c-c26250270d23 lrwxrwxrwx. 1 root root 0 May 16 11:51 ac3ed710-e370-4868-9a74-f89f9dca195f -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.4/ac3ed710-e370-4868-9a74-f89f9dca195f lrwxrwxrwx. 1 root root 0 May 16 11:51 098c800c-a8c0-4973-8f98-3b713b6b385a -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.6/098c800c-a8c0-4973-8f98-3b713b6b385a lrwxrwxrwx. 1 root root 0 May 16 11:51 f363a09b-cf41-446c-99f5-2c121d2c9558 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.0/f363a09b-cf41-446c-99f5-2c121d2c9558 lrwxrwxrwx. 1 root root 0 May 16 11:51 d358956e-2ae4-42fd-b937-9812b9c98512 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.1/d358956e-2ae4-42fd-b937-9812b9c98512 lrwxrwxrwx. 1 root root 0 May 16 11:51 bdd162c6-4c17-46d1-a757-8844503e16d4 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.3/bdd162c6-4c17-46d1-a757-8844503e16d4 lrwxrwxrwx. 1 root root 0 May 16 11:51 18600cc8-d85f-4c0e-b14f-bac37fbce62b -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.2/18600cc8-d85f-4c0e-b14f-bac37fbce62b sh-4.4# exit exit sh-4.4# exit exit Removing debug pod ... Temporary namespace openshift-debug-9xf2l was removed. [kbidarka@localhost nvidia-gpu-operator]$ oc debug node/node4.redhat.com Temporary namespace openshift-debug-k76hh is created for debugging node... Starting pod/node4redhatcom-debug ... To use host binaries, run `chroot /host` Pod IP: 10.10.133.6 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls -ltr /sys/bus/mdev/devices/ total 0 lrwxrwxrwx. 1 root root 0 May 16 11:51 9e508fa5-0656-4a1c-9aad-59adfaa1cd01 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.4/9e508fa5-0656-4a1c-9aad-59adfaa1cd01 lrwxrwxrwx. 1 root root 0 May 16 11:51 927d2525-16aa-4da3-a2a2-dce06c6c9e22 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.5/927d2525-16aa-4da3-a2a2-dce06c6c9e22 lrwxrwxrwx. 1 root root 0 May 16 11:51 5423342f-f87f-45bf-9a87-2fde8de914b8 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.7/5423342f-f87f-45bf-9a87-2fde8de914b8 lrwxrwxrwx. 1 root root 0 May 16 11:51 090780e8-3648-4e57-a95d-f10ab6dfcc5c -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.6/090780e8-3648-4e57-a95d-f10ab6dfcc5c lrwxrwxrwx. 1 root root 0 May 16 11:51 c7c9eb01-0b77-4e67-abb7-1be881bcb16b -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.3/c7c9eb01-0b77-4e67-abb7-1be881bcb16b lrwxrwxrwx. 1 root root 0 May 16 11:51 c3467c3b-10e6-41f4-a5e3-79703fed18da -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.2/c3467c3b-10e6-41f4-a5e3-79703fed18da lrwxrwxrwx. 1 root root 0 May 16 11:51 4e33d704-d839-4df8-b7f5-fec6926f3917 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.0/4e33d704-d839-4df8-b7f5-fec6926f3917 lrwxrwxrwx. 1 root root 0 May 16 11:51 37c54020-e74f-4d91-9c17-edfd2f59dace -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.1/37c54020-e74f-4d91-9c17-edfd2f59dace sh-4.4# exit exit sh-4.4# exit exit Removing debug pod ... Temporary namespace openshift-debug-k76hh was removed. [kbidarka@localhost nvidia-gpu-operator]$ oc debug node/node2.redhat.com Temporary namespace openshift-debug-sthjh is created for debugging node... Starting pod/node2redhatcom-debug ... To use host binaries, run `chroot /host` Pod IP: 10.10.133.4 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls -ltr /sys/bus/mdev/devices/ total 0 lrwxrwxrwx. 1 root root 0 May 16 11:51 a52a1351-dc8b-49ce-8861-2d1625fbde64 -> ../../../devices/pci0000:c9/0000:c9:02.0/0000:ca:00.4/a52a1351-dc8b-49ce-8861-2d1625fbde64 lrwxrwxrwx. 1 root root 0 May 16 11:51 92f13c93-6a8c-492a-863f-da31a57d18ce -> ../../../devices/pci0000:c9/0000:c9:02.0/0000:ca:00.5/92f13c93-6a8c-492a-863f-da31a57d18ce sh-4.4# exit exit sh-4.4# exit exit Removing debug pod ... Temporary namespace openshift-debug-sthjh was removed. [kbidarka@localhost nvidia-gpu-operator]$ oc -n openshift-cnv get kubevirt kubevirt-kubevirt-hyperconverged -o yaml | grep -A 7 permittedHostDevices permittedHostDevices: mediatedDevices: - externalResourceProvider: true mdevNameSelector: NVIDIA A2-2Q resourceName: nvidia.com/GRID_A2_2Q [kbidarka@localhost nvidia-gpu-operator]$ oc describe node node2.redhat.com | grep nvidia nvidia.com/NVIDIA_A30-12C: 2 nvidia.com/NVIDIA_A30-12C: 2 nvidia.com/NVIDIA_A30-12C 0 0 [kbidarka@localhost nvidia-gpu-operator]$ oc describe node node3.redhat.com | grep nvidia nvidia.com/NVIDIA_A2-2Q: 8 nvidia.com/NVIDIA_A2-2Q: 8 nvidia.com/NVIDIA_A2-2Q 0 0 [kbidarka@localhost nvidia-gpu-operator]$ oc describe node node4.redhat.com | grep nvidia nvidia.com/NVIDIA_A2-2Q: 8 nvidia.com/NVIDIA_A2-2Q: 8 nvidia.com/NVIDIA_A2-2Q 0 0 [kbidarka@localhost nvidia-gpu-operator]$
There was a typo, which I fixed, [kbidarka@localhost nvidia-gpu-operator]$ oc -n openshift-cnv get kubevirt kubevirt-kubevirt-hyperconverged -o yaml | grep -A 7 permittedHostDevices permittedHostDevices: mediatedDevices: - externalResourceProvider: true mdevNameSelector: NVIDIA A2-2Q resourceName: nvidia.com/NVIDIA_A2-2Q [kbidarka@localhost watchdog]$ oc get vmi NAME AGE PHASE IP NODENAME READY vm2-rhel87 32s Running 10.xx.xx.xx node3.redhat.com True [kbidarka@localhost watchdog]$ virtctl console vm2-rhel87 Successfully connected to vm2-rhel87 console. The escape sequence is ^] Red Hat Enterprise Linux 8.7 (Ootpa) Kernel 4.18.0-425.13.1.el8_7.x86_64 on an x86_64 Activate the web console with: systemctl enable --now cockpit.socket vm2-rhel87 login: cloud-user Password: [cloud-user@vm2-rhel87 ~]$ lspci -nnv | grep NVIDIA 06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device [10de:1649]
[kbidarka@localhost watchdog]$ oc logs -f virt-handler-8frk4 | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) ^C [kbidarka@localhost watchdog]$ oc logs -f virt-handler-f8nrq | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) ^C [kbidarka@localhost watchdog]$ oc logs -f virt-handler-fbnbj | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) ^C [kbidarka@localhost watchdog]$ oc get vmi NAME AGE PHASE IP NODENAME READY vm2-rhel87 11m Running 10.xx.xx.xx node03.redhat.com True --- We no longer see this msg, "Successfully removed mdev" from the virt-handler pods. Moving this bug to VERIFIED state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.12.6 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:4982