Bug 2184435 - [cnv-4.12] virt-handler should not delete any pre-configured mediated devices i these are provided by an external provider
Summary: [cnv-4.12] virt-handler should not delete any pre-configured mediated devices...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.12.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.3
Assignee: Antonio Cardace
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On:
Blocks: 2184440
TreeView+ depends on / blocked
 
Reported: 2023-04-04 16:51 UTC by Kedar Bidarkar
Modified: 2023-09-05 16:30 UTC (History)
2 users (show)

Fixed In Version: hco-bundle-registry-container-v4.12.3-70
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-09-05 16:29:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 9690 0 None open [release-0.58] device manager: externally provided mediated devices should not be removed 2023-05-03 15:11:14 UTC
Red Hat Issue Tracker CNV-27888 0 None None None 2023-04-04 16:54:05 UTC
Red Hat Product Errata RHSA-2023:4982 0 None None None 2023-09-05 16:30:24 UTC

Description Kedar Bidarkar 2023-04-04 16:51:50 UTC
This bug was initially created as a copy of Bug #2169880

I am copying this bug because: 



Description of problem:


Virt-handler deletes any pre-configured mediated device even if nothing is configured under spec.configuration.mediatedDevicesConfiguration.

On a default installation of OCP Virt 4.12, the virt-handler pod is deleting any mdev device that is created on the system.

This is reproducible with an empty permittedHostDevices configuration:
permittedHostDevices: {}
and with the following config where externalResourceProvider=true is explicitly set for the mdev device.

permittedHostDevices:
  mediatedDevices:
  - externalResourceProvider: true
    mdevNameSelector: NVIDIA A10-24Q
    resourceName: nvidia.com/NVIDIA_A10-24Q


Consider the following pre-configured mdev (vGPU) devices:

 

[core@cnt-a100-bm ~]$ ls -ltr /sys/bus/mdev/devices/

total 0

lrwxrwxrwx. 1 root root 0 Feb  8 18:46 63b0b313-a62f-4475-b274-c26dd7defbcd -> ../../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.4/63b0b313-a62f-4475-b274-c26dd7defbcd

lrwxrwxrwx. 1 root root 0 Feb  8 18:46 203276d5-ac06-4585-baf3-ff16e119d634 -> ../../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.5/203276d5-ac06-4585-baf3-ff16e119d634

lrwxrwxrwx. 1 root root 0 Feb  8 18:46 f4fd3e66-f062-45dd-8ec0-39a7a2201490 -> ../../../devices/pci0000:d7/0000:d7:00.0/0000:d8:00.5/f4fd3e66-f062-45dd-8ec0-39a7a2201490

lrwxrwxrwx. 1 root root 0 Feb  8 18:46 2bd56759-c812-4784-9aea-3d9df23d15d3 -> ../../../devices/pci0000:d7/0000:d7:00.0/0000:d8:00.4/2bd56759-c812-4784-9aea-3d9df23d15d3

 

The devices get deleted by virt-handler shortly after:

 

[core@cnt-a100-bm ~]$ ls -ltr /sys/bus/mdev/devices/
total 0
 
[core@cnt-a100-bm ~]$ oc logs -n openshift-cnv virt-handler-fp426
. . .
{"component":"virt-handler","level":"info","msg":"resyncing virt-launcher domains","pos":"cache.go:385","timestamp":"2023-02-08T18:44:44.363329Z"}
{"component":"virt-handler","level":"info","msg":"refreshed device plugins for permitted/forbidden host devices","pos":"device_controller.go:320","timestamp":"2023-02-08T18:45:07.702002Z"}
{"component":"virt-handler","level":"info","msg":"enabled device-plugins for: []","pos":"device_controller.go:321","timestamp":"2023-02-08T18:45:07.702053Z"}
{"component":"virt-handler","level":"info","msg":"disabled device-plugins for: []","pos":"device_controller.go:322","timestamp":"2023-02-08T18:45:07.702064Z"}
{"component":"virt-handler","level":"info","msg":"Successfully removed mdev 203276d5-ac06-4585-baf3-ff16e119d634","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.327329Z"}
{"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 203276d5-ac06-4585-baf3-ff16e119d634","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.327396Z"}
{"component":"virt-handler","level":"info","msg":"Successfully removed mdev 2bd56759-c812-4784-9aea-3d9df23d15d3","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.375248Z"}
{"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 2bd56759-c812-4784-9aea-3d9df23d15d3","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.375311Z"}
{"component":"virt-handler","level":"info","msg":"Successfully removed mdev 63b0b313-a62f-4475-b274-c26dd7defbcd","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.411902Z"}
{"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 63b0b313-a62f-4475-b274-c26dd7defbcd","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.411965Z"}
{"component":"virt-handler","level":"info","msg":"Successfully removed mdev f4fd3e66-f062-45dd-8ec0-39a7a2201490","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.444875Z"}
{"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: f4fd3e66-f062-45dd-8ec0-39a7a2201490","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.444917Z"}
 




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Antonio Cardace 2023-05-03 15:11:14 UTC
Created manual backport at https://github.com/kubevirt/kubevirt/pull/9690.

Comment 3 Kedar Bidarkar 2023-05-16 12:19:06 UTC
[kbidarka@localhost nvidia-gpu-operator]$ oc get pods
NAME                                                   READY   STATUS    RESTARTS        AGE
...
virt-handler-8frk4                                     1/1     Running   0               30m
virt-handler-f8nrq                                     1/1     Running   0               30m
virt-handler-fbnbj                                     1/1     Running   0               30m
...
[kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-8frk4 | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
^C
[kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-f8nrq | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
^C
[kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-fbnbj | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
^C
[kbidarka@localhost nvidia-gpu-operator]$ oc debug node/node3.redhat.com
Temporary namespace openshift-debug-9xf2l is created for debugging node...
Starting pod/node3redhatcom-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.10.133.5
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ls -ltr /sys/bus/mdev/devices/
total 0
lrwxrwxrwx. 1 root root 0 May 16 11:51 f51d8e5d-158f-4eac-88c5-43e6cf353cd9 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.7/f51d8e5d-158f-4eac-88c5-43e6cf353cd9
lrwxrwxrwx. 1 root root 0 May 16 11:51 e13804af-d995-4e91-992c-c26250270d23 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.5/e13804af-d995-4e91-992c-c26250270d23
lrwxrwxrwx. 1 root root 0 May 16 11:51 ac3ed710-e370-4868-9a74-f89f9dca195f -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.4/ac3ed710-e370-4868-9a74-f89f9dca195f
lrwxrwxrwx. 1 root root 0 May 16 11:51 098c800c-a8c0-4973-8f98-3b713b6b385a -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.6/098c800c-a8c0-4973-8f98-3b713b6b385a
lrwxrwxrwx. 1 root root 0 May 16 11:51 f363a09b-cf41-446c-99f5-2c121d2c9558 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.0/f363a09b-cf41-446c-99f5-2c121d2c9558
lrwxrwxrwx. 1 root root 0 May 16 11:51 d358956e-2ae4-42fd-b937-9812b9c98512 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.1/d358956e-2ae4-42fd-b937-9812b9c98512
lrwxrwxrwx. 1 root root 0 May 16 11:51 bdd162c6-4c17-46d1-a757-8844503e16d4 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.3/bdd162c6-4c17-46d1-a757-8844503e16d4
lrwxrwxrwx. 1 root root 0 May 16 11:51 18600cc8-d85f-4c0e-b14f-bac37fbce62b -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.2/18600cc8-d85f-4c0e-b14f-bac37fbce62b
sh-4.4# exit
exit
sh-4.4# exit
exit

Removing debug pod ...
Temporary namespace openshift-debug-9xf2l was removed.
[kbidarka@localhost nvidia-gpu-operator]$ oc debug node/node4.redhat.com
Temporary namespace openshift-debug-k76hh is created for debugging node...
Starting pod/node4redhatcom-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.10.133.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ls -ltr /sys/bus/mdev/devices/
total 0
lrwxrwxrwx. 1 root root 0 May 16 11:51 9e508fa5-0656-4a1c-9aad-59adfaa1cd01 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.4/9e508fa5-0656-4a1c-9aad-59adfaa1cd01
lrwxrwxrwx. 1 root root 0 May 16 11:51 927d2525-16aa-4da3-a2a2-dce06c6c9e22 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.5/927d2525-16aa-4da3-a2a2-dce06c6c9e22
lrwxrwxrwx. 1 root root 0 May 16 11:51 5423342f-f87f-45bf-9a87-2fde8de914b8 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.7/5423342f-f87f-45bf-9a87-2fde8de914b8
lrwxrwxrwx. 1 root root 0 May 16 11:51 090780e8-3648-4e57-a95d-f10ab6dfcc5c -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:00.6/090780e8-3648-4e57-a95d-f10ab6dfcc5c
lrwxrwxrwx. 1 root root 0 May 16 11:51 c7c9eb01-0b77-4e67-abb7-1be881bcb16b -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.3/c7c9eb01-0b77-4e67-abb7-1be881bcb16b
lrwxrwxrwx. 1 root root 0 May 16 11:51 c3467c3b-10e6-41f4-a5e3-79703fed18da -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.2/c3467c3b-10e6-41f4-a5e3-79703fed18da
lrwxrwxrwx. 1 root root 0 May 16 11:51 4e33d704-d839-4df8-b7f5-fec6926f3917 -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.0/4e33d704-d839-4df8-b7f5-fec6926f3917
lrwxrwxrwx. 1 root root 0 May 16 11:51 37c54020-e74f-4d91-9c17-edfd2f59dace -> ../../../devices/pci0000:4a/0000:4a:02.0/0000:4b:01.1/37c54020-e74f-4d91-9c17-edfd2f59dace
sh-4.4# exit
exit
sh-4.4# exit
exit

Removing debug pod ...
Temporary namespace openshift-debug-k76hh was removed.
[kbidarka@localhost nvidia-gpu-operator]$ oc debug node/node2.redhat.com
Temporary namespace openshift-debug-sthjh is created for debugging node...
Starting pod/node2redhatcom-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.10.133.4
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ls -ltr /sys/bus/mdev/devices/
total 0
lrwxrwxrwx. 1 root root 0 May 16 11:51 a52a1351-dc8b-49ce-8861-2d1625fbde64 -> ../../../devices/pci0000:c9/0000:c9:02.0/0000:ca:00.4/a52a1351-dc8b-49ce-8861-2d1625fbde64
lrwxrwxrwx. 1 root root 0 May 16 11:51 92f13c93-6a8c-492a-863f-da31a57d18ce -> ../../../devices/pci0000:c9/0000:c9:02.0/0000:ca:00.5/92f13c93-6a8c-492a-863f-da31a57d18ce
sh-4.4# exit
exit
sh-4.4# exit
exit

Removing debug pod ...
Temporary namespace openshift-debug-sthjh was removed.
[kbidarka@localhost nvidia-gpu-operator]$ oc -n openshift-cnv get kubevirt kubevirt-kubevirt-hyperconverged -o yaml | grep -A 7 permittedHostDevices
    permittedHostDevices:
      mediatedDevices:
      - externalResourceProvider: true
        mdevNameSelector: NVIDIA A2-2Q
        resourceName: nvidia.com/GRID_A2_2Q

[kbidarka@localhost nvidia-gpu-operator]$ oc describe node node2.redhat.com | grep nvidia 
                    
  nvidia.com/NVIDIA_A30-12C:      2
  nvidia.com/NVIDIA_A30-12C:      2
  nvidia.com/NVIDIA_A30-12C      0              0
[kbidarka@localhost nvidia-gpu-operator]$ oc describe node node3.redhat.com | grep nvidia 
                    
  nvidia.com/NVIDIA_A2-2Q:        8
  nvidia.com/NVIDIA_A2-2Q:        8
 
  nvidia.com/NVIDIA_A2-2Q        0              0
[kbidarka@localhost nvidia-gpu-operator]$ oc describe node node4.redhat.com | grep nvidia 
                    
  nvidia.com/NVIDIA_A2-2Q:        8
  nvidia.com/NVIDIA_A2-2Q:        8
  
  nvidia.com/NVIDIA_A2-2Q        0              0
[kbidarka@localhost nvidia-gpu-operator]$

Comment 4 Kedar Bidarkar 2023-05-16 12:47:12 UTC
There was a typo, which I fixed, 

[kbidarka@localhost nvidia-gpu-operator]$ oc -n openshift-cnv get kubevirt kubevirt-kubevirt-hyperconverged -o yaml | grep -A 7 permittedHostDevices
    permittedHostDevices:
      mediatedDevices:
      - externalResourceProvider: true
        mdevNameSelector: NVIDIA A2-2Q
        resourceName: nvidia.com/NVIDIA_A2-2Q


[kbidarka@localhost watchdog]$ oc get vmi 
NAME         AGE   PHASE     IP             NODENAME                                         READY
vm2-rhel87   32s   Running   10.xx.xx.xx   node3.redhat.com   True

[kbidarka@localhost watchdog]$ virtctl console vm2-rhel87
Successfully connected to vm2-rhel87 console. The escape sequence is ^]

Red Hat Enterprise Linux 8.7 (Ootpa)
Kernel 4.18.0-425.13.1.el8_7.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm2-rhel87 login: cloud-user
Password:

[cloud-user@vm2-rhel87 ~]$ lspci -nnv | grep NVIDIA 
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device [10de:1649]

Comment 5 Kedar Bidarkar 2023-05-16 12:53:59 UTC
[kbidarka@localhost watchdog]$ oc logs -f virt-handler-8frk4 | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
^C
[kbidarka@localhost watchdog]$ oc logs -f virt-handler-f8nrq | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
^C
[kbidarka@localhost watchdog]$ oc logs -f virt-handler-fbnbj | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
^C
[kbidarka@localhost watchdog]$ oc get vmi 
NAME         AGE   PHASE     IP             NODENAME                                         READY
vm2-rhel87   11m   Running   10.xx.xx.xx   node03.redhat.com   True

---

We no longer see this msg, "Successfully removed mdev" from the virt-handler pods.

Moving this bug to VERIFIED state.

Comment 11 errata-xmlrpc 2023-09-05 16:29:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.12.6 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:4982


Note You need to log in before you can comment on or make changes to this bug.