Bug 2169880

Summary:

virt-handler should not delete any pre-configured mediated devices i these are provided by an external provider

Product:

Container Native Virtualization (CNV)

Reporter:

Vladik Romanovsky <vromanso>

Component:

Virtualization

Assignee:

Vladik Romanovsky <vromanso>

Status:

CLOSED ERRATA

QA Contact:

Kedar Bidarkar <kbidarka>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.12.0

CC:

acardace, cdesiniotis, fdeutsch

Target Milestone:

---

Target Release:

4.13.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

hco-bundle-registry-container-v4.13.0.rhel9-1671

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-05-18 02:57:49 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
the default vgpu configMap	none
updated the config map to the new config format	none

Description Vladik Romanovsky 2023-02-14 23:07:16 UTC

Description of problem:


Virt-handler deletes any pre-configured mediated device even if nothing is configured under spec.configuration.mediatedDevicesConfiguration.

On a default installation of OCP Virt 4.12, the virt-handler pod is deleting any mdev device that is created on the system.

This is reproducible with an empty permittedHostDevices configuration:
permittedHostDevices: {}
and with the following config where externalResourceProvider=true is explicitly set for the mdev device.

permittedHostDevices:
  mediatedDevices:
  - externalResourceProvider: true
    mdevNameSelector: NVIDIA A10-24Q
    resourceName: nvidia.com/NVIDIA_A10-24Q


Consider the following pre-configured mdev (vGPU) devices:

 

[core@cnt-a100-bm ~]$ ls -ltr /sys/bus/mdev/devices/

total 0

lrwxrwxrwx. 1 root root 0 Feb  8 18:46 63b0b313-a62f-4475-b274-c26dd7defbcd -> ../../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.4/63b0b313-a62f-4475-b274-c26dd7defbcd

lrwxrwxrwx. 1 root root 0 Feb  8 18:46 203276d5-ac06-4585-baf3-ff16e119d634 -> ../../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.5/203276d5-ac06-4585-baf3-ff16e119d634

lrwxrwxrwx. 1 root root 0 Feb  8 18:46 f4fd3e66-f062-45dd-8ec0-39a7a2201490 -> ../../../devices/pci0000:d7/0000:d7:00.0/0000:d8:00.5/f4fd3e66-f062-45dd-8ec0-39a7a2201490

lrwxrwxrwx. 1 root root 0 Feb  8 18:46 2bd56759-c812-4784-9aea-3d9df23d15d3 -> ../../../devices/pci0000:d7/0000:d7:00.0/0000:d8:00.4/2bd56759-c812-4784-9aea-3d9df23d15d3

 

The devices get deleted by virt-handler shortly after:

 

[core@cnt-a100-bm ~]$ ls -ltr /sys/bus/mdev/devices/
total 0
 
[core@cnt-a100-bm ~]$ oc logs -n openshift-cnv virt-handler-fp426
. . .
{"component":"virt-handler","level":"info","msg":"resyncing virt-launcher domains","pos":"cache.go:385","timestamp":"2023-02-08T18:44:44.363329Z"}
{"component":"virt-handler","level":"info","msg":"refreshed device plugins for permitted/forbidden host devices","pos":"device_controller.go:320","timestamp":"2023-02-08T18:45:07.702002Z"}
{"component":"virt-handler","level":"info","msg":"enabled device-plugins for: []","pos":"device_controller.go:321","timestamp":"2023-02-08T18:45:07.702053Z"}
{"component":"virt-handler","level":"info","msg":"disabled device-plugins for: []","pos":"device_controller.go:322","timestamp":"2023-02-08T18:45:07.702064Z"}
{"component":"virt-handler","level":"info","msg":"Successfully removed mdev 203276d5-ac06-4585-baf3-ff16e119d634","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.327329Z"}
{"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 203276d5-ac06-4585-baf3-ff16e119d634","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.327396Z"}
{"component":"virt-handler","level":"info","msg":"Successfully removed mdev 2bd56759-c812-4784-9aea-3d9df23d15d3","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.375248Z"}
{"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 2bd56759-c812-4784-9aea-3d9df23d15d3","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.375311Z"}
{"component":"virt-handler","level":"info","msg":"Successfully removed mdev 63b0b313-a62f-4475-b274-c26dd7defbcd","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.411902Z"}
{"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: 63b0b313-a62f-4475-b274-c26dd7defbcd","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.411965Z"}
{"component":"virt-handler","level":"info","msg":"Successfully removed mdev f4fd3e66-f062-45dd-8ec0-39a7a2201490","pos":"common.go:168","timestamp":"2023-02-08T18:47:06.444875Z"}
{"component":"virt-handler","level":"warning","msg":"failed to remove mdev type: f4fd3e66-f062-45dd-8ec0-39a7a2201490","pos":"mediated_devices_types.go:270","timestamp":"2023-02-08T18:47:06.444917Z"}
 




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Kedar Bidarkar 2023-04-04 15:29:36 UTC

1) Updated the FG to 

]$ oc annotate --overwrite -n openshift-cnv hyperconverged kubevirt-hyperconverged kubevirt.kubevirt.io/jsonpatch='[{
      "op": "add",
      "path": "/spec/configuration/developerConfiguration/featureGates/-",
      "value": "DisableMDEVConfiguration"
  }]'

2) Tried configuring Nvidia GPU Operator, but see issue.

kbidarka@localhost nvidia-gpu-operator]$ oc get pods
NAME                                                        READY   STATUS     RESTARTS   AGE
gpu-operator-db6888c55-qsz64                                1/1     Running    0          4m50s
nvidia-sandbox-device-plugin-daemonset-grtjn                1/1     Running    0          3m57s
nvidia-sandbox-device-plugin-daemonset-jhnn8                0/1     Init:1/2   0          68s
nvidia-sandbox-device-plugin-daemonset-zsmxr                0/1     Init:1/2   0          60s
nvidia-sandbox-validator-8fvp7                              0/1     Init:2/3   0          68s
nvidia-sandbox-validator-kvsx4                              1/1     Running    0          3m57s
nvidia-sandbox-validator-p5r7c                              0/1     Init:2/3   0          60s
nvidia-vfio-manager-k6ltm                                   1/1     Running    0          4m32s
nvidia-vgpu-device-manager-f8992                            1/1     Running    0          60s
nvidia-vgpu-device-manager-kjhhf                            1/1     Running    0          68s
nvidia-vgpu-manager-daemonset-413.92.202303281804-0-gzpkh   2/2     Running    0          102s
nvidia-vgpu-manager-daemonset-413.92.202303281804-0-hr82g   2/2     Running    0          94s
[kbidarka@localhost nvidia-gpu-operator]$ 
[kbidarka@localhost nvidia-gpu-operator]$ 
[kbidarka@localhost nvidia-gpu-operator]$ oc logs -f nvidia-vgpu-device-manager-f8992
Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init)
W0404 15:20:47.570022       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-04-04T15:20:47Z" level=info msg="Updating to vGPU config: A2-2Q"
time="2023-04-04T15:20:47Z" level=info msg="Asserting that the requested configuration is present in the configuration file"
time="2023-04-04T15:20:47Z" level=fatal msg="error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal string into Go value of type map[string]json.RawMessage"
time="2023-04-04T15:20:47Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'failed'"
time="2023-04-04T15:20:47Z" level=error msg="ERROR: Unable to validate the selected vGPU configuration"
time="2023-04-04T15:20:47Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"



3) We are seeing the above issue when trying to configure the vGPU with the below config.

sandboxDevicePlugin:
    enabled: true
vgpuDeviceManager:
    enabled: true
vfioManager:
  enabled: true

4) Have also added the below to HCO CR
  permittedHostDevices:
    mediatedDevices:
    - externalResourceProvider: true
      mdevNameSelector: NVIDIA A2-2Q
      resourceName: nvidia.com/NVIDIA_A2-2Q

5) see the below label set to the nodes
nvidia.com/vgpu.config.state=failed

Comment 2 Christopher Desiniotis 2023-04-04 18:25:59 UTC

What version of gpu-operator and vgpu-device-manager are being used?

Comment 3 Kedar Bidarkar 2023-04-05 18:01:45 UTC

gpu-operator: gpu-operator-certified.v23.3.0

Assuming, version for vgpu-device-manager is the container image info 
        
"NVIDIA vGPU Device Manager Image": nvcr.io/nvidia/cloud-native/vgpu-device-manager@sha256:2d7e32e3d30c2415b4eb0b48ff4ce5a4ccabaf69ede0486305ed51d26cab7713

Comment 4 Christopher Desiniotis 2023-04-06 15:17:25 UTC

Can you get the contents of the "default-vgpu-devices-config" ConfigMap in the nvidia-gpu-operator namespace? This is a large config, so just a short snippet will suffice.

Comment 5 Kedar Bidarkar 2023-04-10 11:48:59 UTC

Created attachment 1956587 [details]
the default vgpu configMap

Comment 6 Kedar Bidarkar 2023-04-10 11:52:22 UTC

Thanks, I got the clue looking at https://github.com/NVIDIA/vgpu-device-manager/blob/main/examples/config-example.yaml

I see, "Update example config with new config file format". Will test with this new format.

Comment 7 Kedar Bidarkar 2023-04-10 13:21:29 UTC

I have now updated the vgpu-config as per the new config file format and now we see all the pods in Running state as seen below.

]$ oc get pods
NAME                                                        READY   STATUS    RESTARTS   AGE
gpu-operator-79cfb96c95-j455p                               1/1     Running   0          6m24s
nvidia-sandbox-device-plugin-daemonset-2l8nl                1/1     Running   0          4m27s
nvidia-sandbox-device-plugin-daemonset-6wnwf                1/1     Running   0          4m33s
nvidia-sandbox-device-plugin-daemonset-ww9dp                1/1     Running   0          4m27s
nvidia-sandbox-validator-2f2dg                              1/1     Running   0          4m27s
nvidia-sandbox-validator-bjknt                              1/1     Running   0          4m27s
nvidia-sandbox-validator-d4n92                              1/1     Running   0          4m33s
nvidia-vgpu-device-manager-czw76                            1/1     Running   0          5m37s
nvidia-vgpu-device-manager-rhkpw                            1/1     Running   0          5m36s
nvidia-vgpu-device-manager-zv6xl                            1/1     Running   0          5m37s
nvidia-vgpu-manager-daemonset-413.92.202303281804-0-5nlrf   2/2     Running   0          6m12s
nvidia-vgpu-manager-daemonset-413.92.202303281804-0-nprpx   2/2     Running   0          6m12s
nvidia-vgpu-manager-daemonset-413.92.202303281804-0-r5gvg   2/2     Running   0          6m12s


Will continue with the testing.

Comment 8 Kedar Bidarkar 2023-04-10 13:26:10 UTC

Created attachment 1956622 [details]
updated the config map to the new config format

Comment 10 Kedar Bidarkar 2023-04-10 14:45:41 UTC

1) Updated the FG to 

]$ oc annotate --overwrite -n openshift-cnv hyperconverged kubevirt-hyperconverged kubevirt.kubevirt.io/jsonpatch='[{
      "op": "add",
      "path": "/spec/configuration/developerConfiguration/featureGates/-",
      "value": "DisableMDEVConfiguration"
  }]'

2) Mdev devices no longer removed, as seen from virt-handler logs.

[kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-94vw2 | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)

[kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-xgvdf | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)


[kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-nzhtd | grep "Successfully removed mdev"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)

---
3) The mdev devices do persist for a long period.

Summary: Virt-handler no longer deletes any pre-configured mediated device.

Comment 11 Kedar Bidarkar 2023-04-10 16:38:21 UTC

Thanks Chris for the help on verifying this bug, was able to quickly identify the issue.

Comment 13 errata-xmlrpc 2023-05-18 02:57:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3205