Description of problem: Virt-handler deletes any pre-configured mediated device even if nothing is configured under spec.configuration.mediatedDevicesConfiguration. On a default installation of OCP Virt 4.12, the virt-handler pod is deleting any mdev device that is created on the system. This is reproducible with an empty permittedHostDevices configuration: permittedHostDevices: {} and with the following config where externalResourceProvider=true is explicitly set for the mdev device. permittedHostDevices: mediatedDevices: - externalResourceProvider: true mdevNameSelector: NVIDIA A10-24Q resourceName: nvidia.com/NVIDIA_A10-24Q Consider the following pre-configured mdev (vGPU) devices: [core@cnt-a100-bm ~]$ ls -ltr /sys/bus/mdev/devices/ total 0 lrwxrwxrwx. 1 root root 0 Feb 8 18:46 63b0b313-a62f-4475-b274-c26dd7defbcd -> ../../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.4/63b0b313-a62f-4475-b274-c26dd7defbcd lrwxrwxrwx. 1 root root 0 Feb 8 18:46 203276d5-ac06-4585-baf3-ff16e119d634 -> ../../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.5/203276d5-ac06-4585-baf3-ff16e119d634 lrwxrwxrwx. 1 root root 0 Feb 8 18:46 f4fd3e66-f062-45dd-8ec0-39a7a2201490 -> ../../../devices/pci0000:d7/0000:d7:00.0/0000:d8:00.5/f4fd3e66-f062-45dd-8ec0-39a7a2201490 lrwxrwxrwx. 1 root root 0 Feb 8 18:46 2bd56759-c812-4784-9aea-3d9df23d15d3 -> ../../../devices/pci0000:d7/0000:d7:00.0/0000:d8:00.4/2bd56759-c812-4784-9aea-3d9df23d15d3 The devices get deleted by virt-handler shortly after: [core@cnt-a100-bm ~]$ ls -ltr /sys/bus/mdev/devices/ total 0 [core@cnt-a100-bm ~]$ oc logs -n openshift-cnv virt-handler-fp426 . . . {"component":"virt-handler","level":"info","msg":"resyncing virt-launcher domains","pos":"cache.go:385","timestamp":"2023-02-08T18:44:44.363329Z"} {"component":"virt-handler","level":"info","msg":"refreshed device plugins for permitted/forbidden host devices","pos":"device_controller.go:320","timestamp":"2023-02-08T18:45:07.702002Z"} {"component":"virt-handler","level":"info","msg":"enabled device-plugins for: []","pos":"device_controller.go:321","timestamp":"2023-02-08T18:45:07.702053Z"} {"component":"virt-handler","level":"info","msg":"disabled device-plugins for: []","pos":"device_controller.go:322","timestamp":"2023-02-08T18:45:07.702064Z"} {"component":"virt-handler","level":"info","msg":"Successfully removed mdev 203276d5-ac06-4585-baf3-ff16e119d634","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.327329Z"} {"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 203276d5-ac06-4585-baf3-ff16e119d634","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.327396Z"} {"component":"virt-handler","level":"info","msg":"Successfully removed mdev 2bd56759-c812-4784-9aea-3d9df23d15d3","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.375248Z"} {"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 2bd56759-c812-4784-9aea-3d9df23d15d3","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.375311Z"} {"component":"virt-handler","level":"info","msg":"Successfully removed mdev 63b0b313-a62f-4475-b274-c26dd7defbcd","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.411902Z"} {"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 63b0b313-a62f-4475-b274-c26dd7defbcd","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.411965Z"} {"component":"virt-handler","level":"info","msg":"Successfully removed mdev f4fd3e66-f062-45dd-8ec0-39a7a2201490","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.444875Z"} {"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: f4fd3e66-f062-45dd-8ec0-39a7a2201490","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.444917Z"} Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
1) Updated the FG to ]$ oc annotate --overwrite -n openshift-cnv hyperconverged kubevirt-hyperconverged kubevirt.kubevirt.io/jsonpatch='[{ "op": "add", "path": "/spec/configuration/developerConfiguration/featureGates/-", "value": "DisableMDEVConfiguration" }]' 2) Tried configuring Nvidia GPU Operator, but see issue. kbidarka@localhost nvidia-gpu-operator]$ oc get pods NAME READY STATUS RESTARTS AGE gpu-operator-db6888c55-qsz64 1/1 Running 0 4m50s nvidia-sandbox-device-plugin-daemonset-grtjn 1/1 Running 0 3m57s nvidia-sandbox-device-plugin-daemonset-jhnn8 0/1 Init:1/2 0 68s nvidia-sandbox-device-plugin-daemonset-zsmxr 0/1 Init:1/2 0 60s nvidia-sandbox-validator-8fvp7 0/1 Init:2/3 0 68s nvidia-sandbox-validator-kvsx4 1/1 Running 0 3m57s nvidia-sandbox-validator-p5r7c 0/1 Init:2/3 0 60s nvidia-vfio-manager-k6ltm 1/1 Running 0 4m32s nvidia-vgpu-device-manager-f8992 1/1 Running 0 60s nvidia-vgpu-device-manager-kjhhf 1/1 Running 0 68s nvidia-vgpu-manager-daemonset-413.92.202303281804-0-gzpkh 2/2 Running 0 102s nvidia-vgpu-manager-daemonset-413.92.202303281804-0-hr82g 2/2 Running 0 94s [kbidarka@localhost nvidia-gpu-operator]$ [kbidarka@localhost nvidia-gpu-operator]$ [kbidarka@localhost nvidia-gpu-operator]$ oc logs -f nvidia-vgpu-device-manager-f8992 Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init) W0404 15:20:47.570022 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. time="2023-04-04T15:20:47Z" level=info msg="Updating to vGPU config: A2-2Q" time="2023-04-04T15:20:47Z" level=info msg="Asserting that the requested configuration is present in the configuration file" time="2023-04-04T15:20:47Z" level=fatal msg="error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal string into Go value of type map[string]json.RawMessage" time="2023-04-04T15:20:47Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'failed'" time="2023-04-04T15:20:47Z" level=error msg="ERROR: Unable to validate the selected vGPU configuration" time="2023-04-04T15:20:47Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label" 3) We are seeing the above issue when trying to configure the vGPU with the below config. sandboxDevicePlugin: enabled: true vgpuDeviceManager: enabled: true vfioManager: enabled: true 4) Have also added the below to HCO CR permittedHostDevices: mediatedDevices: - externalResourceProvider: true mdevNameSelector: NVIDIA A2-2Q resourceName: nvidia.com/NVIDIA_A2-2Q 5) see the below label set to the nodes nvidia.com/vgpu.config.state=failed
What version of gpu-operator and vgpu-device-manager are being used?
gpu-operator: gpu-operator-certified.v23.3.0 Assuming, version for vgpu-device-manager is the container image info "NVIDIA vGPU Device Manager Image": nvcr.io/nvidia/cloud-native/vgpu-device-manager@sha256:2d7e32e3d30c2415b4eb0b48ff4ce5a4ccabaf69ede0486305ed51d26cab7713
Can you get the contents of the "default-vgpu-devices-config" ConfigMap in the nvidia-gpu-operator namespace? This is a large config, so just a short snippet will suffice.
Created attachment 1956587 [details] the default vgpu configMap
Thanks, I got the clue looking at https://github.com/NVIDIA/vgpu-device-manager/blob/main/examples/config-example.yaml I see, "Update example config with new config file format". Will test with this new format.
I have now updated the vgpu-config as per the new config file format and now we see all the pods in Running state as seen below. ]$ oc get pods NAME READY STATUS RESTARTS AGE gpu-operator-79cfb96c95-j455p 1/1 Running 0 6m24s nvidia-sandbox-device-plugin-daemonset-2l8nl 1/1 Running 0 4m27s nvidia-sandbox-device-plugin-daemonset-6wnwf 1/1 Running 0 4m33s nvidia-sandbox-device-plugin-daemonset-ww9dp 1/1 Running 0 4m27s nvidia-sandbox-validator-2f2dg 1/1 Running 0 4m27s nvidia-sandbox-validator-bjknt 1/1 Running 0 4m27s nvidia-sandbox-validator-d4n92 1/1 Running 0 4m33s nvidia-vgpu-device-manager-czw76 1/1 Running 0 5m37s nvidia-vgpu-device-manager-rhkpw 1/1 Running 0 5m36s nvidia-vgpu-device-manager-zv6xl 1/1 Running 0 5m37s nvidia-vgpu-manager-daemonset-413.92.202303281804-0-5nlrf 2/2 Running 0 6m12s nvidia-vgpu-manager-daemonset-413.92.202303281804-0-nprpx 2/2 Running 0 6m12s nvidia-vgpu-manager-daemonset-413.92.202303281804-0-r5gvg 2/2 Running 0 6m12s Will continue with the testing.
Created attachment 1956622 [details] updated the config map to the new config format
1) Updated the FG to ]$ oc annotate --overwrite -n openshift-cnv hyperconverged kubevirt-hyperconverged kubevirt.kubevirt.io/jsonpatch='[{ "op": "add", "path": "/spec/configuration/developerConfiguration/featureGates/-", "value": "DisableMDEVConfiguration" }]' 2) Mdev devices no longer removed, as seen from virt-handler logs. [kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-94vw2 | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) [kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-xgvdf | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) [kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-nzhtd | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) --- 3) The mdev devices do persist for a long period. Summary: Virt-handler no longer deletes any pre-configured mediated device.
Thanks Chris for the help on verifying this bug, was able to quickly identify the issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:3205