Bug 2169880
Summary: | virt-handler should not delete any pre-configured mediated devices i these are provided by an external provider | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Vladik Romanovsky <vromanso> | ||||||
Component: | Virtualization | Assignee: | Vladik Romanovsky <vromanso> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Kedar Bidarkar <kbidarka> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.12.0 | CC: | acardace, cdesiniotis, fdeutsch | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.13.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | hco-bundle-registry-container-v4.13.0.rhel9-1671 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2023-05-18 02:57:49 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Vladik Romanovsky
2023-02-14 23:07:16 UTC
1) Updated the FG to ]$ oc annotate --overwrite -n openshift-cnv hyperconverged kubevirt-hyperconverged kubevirt.kubevirt.io/jsonpatch='[{ "op": "add", "path": "/spec/configuration/developerConfiguration/featureGates/-", "value": "DisableMDEVConfiguration" }]' 2) Tried configuring Nvidia GPU Operator, but see issue. kbidarka@localhost nvidia-gpu-operator]$ oc get pods NAME READY STATUS RESTARTS AGE gpu-operator-db6888c55-qsz64 1/1 Running 0 4m50s nvidia-sandbox-device-plugin-daemonset-grtjn 1/1 Running 0 3m57s nvidia-sandbox-device-plugin-daemonset-jhnn8 0/1 Init:1/2 0 68s nvidia-sandbox-device-plugin-daemonset-zsmxr 0/1 Init:1/2 0 60s nvidia-sandbox-validator-8fvp7 0/1 Init:2/3 0 68s nvidia-sandbox-validator-kvsx4 1/1 Running 0 3m57s nvidia-sandbox-validator-p5r7c 0/1 Init:2/3 0 60s nvidia-vfio-manager-k6ltm 1/1 Running 0 4m32s nvidia-vgpu-device-manager-f8992 1/1 Running 0 60s nvidia-vgpu-device-manager-kjhhf 1/1 Running 0 68s nvidia-vgpu-manager-daemonset-413.92.202303281804-0-gzpkh 2/2 Running 0 102s nvidia-vgpu-manager-daemonset-413.92.202303281804-0-hr82g 2/2 Running 0 94s [kbidarka@localhost nvidia-gpu-operator]$ [kbidarka@localhost nvidia-gpu-operator]$ [kbidarka@localhost nvidia-gpu-operator]$ oc logs -f nvidia-vgpu-device-manager-f8992 Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init) W0404 15:20:47.570022 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. time="2023-04-04T15:20:47Z" level=info msg="Updating to vGPU config: A2-2Q" time="2023-04-04T15:20:47Z" level=info msg="Asserting that the requested configuration is present in the configuration file" time="2023-04-04T15:20:47Z" level=fatal msg="error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal string into Go value of type map[string]json.RawMessage" time="2023-04-04T15:20:47Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'failed'" time="2023-04-04T15:20:47Z" level=error msg="ERROR: Unable to validate the selected vGPU configuration" time="2023-04-04T15:20:47Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label" 3) We are seeing the above issue when trying to configure the vGPU with the below config. sandboxDevicePlugin: enabled: true vgpuDeviceManager: enabled: true vfioManager: enabled: true 4) Have also added the below to HCO CR permittedHostDevices: mediatedDevices: - externalResourceProvider: true mdevNameSelector: NVIDIA A2-2Q resourceName: nvidia.com/NVIDIA_A2-2Q 5) see the below label set to the nodes nvidia.com/vgpu.config.state=failed What version of gpu-operator and vgpu-device-manager are being used? gpu-operator: gpu-operator-certified.v23.3.0 Assuming, version for vgpu-device-manager is the container image info "NVIDIA vGPU Device Manager Image": nvcr.io/nvidia/cloud-native/vgpu-device-manager@sha256:2d7e32e3d30c2415b4eb0b48ff4ce5a4ccabaf69ede0486305ed51d26cab7713 Can you get the contents of the "default-vgpu-devices-config" ConfigMap in the nvidia-gpu-operator namespace? This is a large config, so just a short snippet will suffice. Created attachment 1956587 [details]
the default vgpu configMap
Thanks, I got the clue looking at https://github.com/NVIDIA/vgpu-device-manager/blob/main/examples/config-example.yaml I see, "Update example config with new config file format". Will test with this new format. I have now updated the vgpu-config as per the new config file format and now we see all the pods in Running state as seen below. ]$ oc get pods NAME READY STATUS RESTARTS AGE gpu-operator-79cfb96c95-j455p 1/1 Running 0 6m24s nvidia-sandbox-device-plugin-daemonset-2l8nl 1/1 Running 0 4m27s nvidia-sandbox-device-plugin-daemonset-6wnwf 1/1 Running 0 4m33s nvidia-sandbox-device-plugin-daemonset-ww9dp 1/1 Running 0 4m27s nvidia-sandbox-validator-2f2dg 1/1 Running 0 4m27s nvidia-sandbox-validator-bjknt 1/1 Running 0 4m27s nvidia-sandbox-validator-d4n92 1/1 Running 0 4m33s nvidia-vgpu-device-manager-czw76 1/1 Running 0 5m37s nvidia-vgpu-device-manager-rhkpw 1/1 Running 0 5m36s nvidia-vgpu-device-manager-zv6xl 1/1 Running 0 5m37s nvidia-vgpu-manager-daemonset-413.92.202303281804-0-5nlrf 2/2 Running 0 6m12s nvidia-vgpu-manager-daemonset-413.92.202303281804-0-nprpx 2/2 Running 0 6m12s nvidia-vgpu-manager-daemonset-413.92.202303281804-0-r5gvg 2/2 Running 0 6m12s Will continue with the testing. Created attachment 1956622 [details]
updated the config map to the new config format
1) Updated the FG to ]$ oc annotate --overwrite -n openshift-cnv hyperconverged kubevirt-hyperconverged kubevirt.kubevirt.io/jsonpatch='[{ "op": "add", "path": "/spec/configuration/developerConfiguration/featureGates/-", "value": "DisableMDEVConfiguration" }]' 2) Mdev devices no longer removed, as seen from virt-handler logs. [kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-94vw2 | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) [kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-xgvdf | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) [kbidarka@localhost nvidia-gpu-operator]$ oc logs -f virt-handler-nzhtd | grep "Successfully removed mdev" Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init) --- 3) The mdev devices do persist for a long period. Summary: Virt-handler no longer deletes any pre-configured mediated device. Thanks Chris for the help on verifying this bug, was able to quickly identify the issue. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:3205 |