Bug 2142768

Summary: Failed to create VM with vGPU - Hit error "badly formed hexadecimal UUID string"
Product: Red Hat OpenStack Reporter: chhu
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED DUPLICATE QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: high Docs Contact:
Priority: unspecified    
Version: 17.0 (Wallaby)CC: dasmith, eglynn, jhakimra, kchamart, sbauza, sgordon, vromanso
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: libvirt_OSP_INT
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-16 16:22:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description chhu 2022-11-15 08:35:55 UTC
Description of problem:
Failed to create VM with vGPU, hit the error: nova/virt/libvirt/utils.py
"ValueError: badly formed hexadecimal UUID string"

Version-Release number of selected component (if applicable):
rhosp17-openstack-nova-compute:17.0_20220908.1
python3-nova-23.2.2-0.20220720130412.7074ac0.el9ost.noarch
python3-novaclient-17.4.0-0.20210812172018.54d4da1.el9ost.noarch
openstack-nova-common-23.2.2-0.20220720130412.7074ac0.el9ost.noarch
openstack-nova-compute-23.2.2-0.20220720130412.7074ac0.el9ost.noarch
openstack-nova-migration-23.2.2-0.20220720130412.7074ac0.el9ost.noarch

Use libvirt with the fix for Bug 2109450 - libvirt doesn't catch mdevs created thru sysfs

How reproducible:
100%

Steps to Reproduce:
1. Prepare the vGPU environment on OSP17.0
(undercloud) [stack@dell-per740-66 ~]$ ssh heat-admin.24.10
[heat-admin@compute-0 ~]$ lspci|grep VGA
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
3d:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)
3e:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)

[heat-admin@compute-0 ~]$ cat /sys/class/mdev_bus/0000\:3d\:00.0/mdev_supported_types/nvidia-22/name 
GRID M60-8Q
[heat-admin@compute-0 ~]$ uuid=$(uuidgen)
[heat-admin@compute-0 ~]$ cd /sys/class/mdev_bus/0000:3d:00.0/mdev_supported_types/nvidia-22
[heat-admin@compute-0 nvidia-22]$ sudo chmod 666 create
[heat-admin@compute-0 nvidia-22]$ sudo echo $uuid
b81a2fb4-1bcf-45b0-b61e-efba7f35b161
[heat-admin@compute-0 nvidia-22]$ sudo echo $uuid > create
[heat-admin@compute-0 nvidia-22]$ cd ../../
[heat-admin@compute-0 0000:3d:00.0]$ ls
b81a2fb4-1bcf-45b0-b61e-efba7f35b161  d3cold_allowed            iommu            mdev_supported_types  rescan        resource3_wc

2. Check in nova_virtqemud, mdev is present in the list of node devices
[heat-admin@compute-0 ~]$ sudo podman exec -it nova_virtqemud python3
Python 3.9.10 (main, Feb  9 2022, 00:00:00) 
[GCC 11.2.1 20220127 (Red Hat 11.2.1-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import libvirt
>>> conn = libvirt.open('qemu:///system')
>>> conn.listDevices('mdev')
['mdev_b81a2fb4_1bcf_45b0_b61e_efba7f35b161_0000_3d_00_0']

3. Try to start VM with the vGPU device, hit error
(overcloud) [stack@dell-per740-66 ~]$ openstack flavor create --vcpus 6 --ram 4196 --disk 20 m2
(overcloud) [stack@dell-per740-66 ~]$ openstack flavor set m2 --property "resources:VGPU=1"
(overcloud) [stack@dell-per740-66 ~]$ openstack network create default
(overcloud) [stack@dell-per740-66 ~]$ openstack network list
+--------------------------------------+---------+---------------------------------
| ID                                   | Name    | Subnets                              |
+--------------------------------------+---------+---------------------------------
| 1dba36ee-d473-4354-a5ab-d6c7b6e0e666 | default | 2d6a822b-a511-47c2-918e-37ee947c0a8d |
+--------------------------------------+---------+---------------------------------

(overcloud) [stack@dell-per740-66 ~]$ openstack subnet create default --network default --gateway 192.168.32.1 --subnet-range 192.168.32.0/24
(overcloud) [stack@dell-per740-66 ~]$ openstack image create r9-qcow2 --disk-format qcow2 --container-format bare --file RHEL-9.0-x86_64-latest.qcow2
(overcloud) [stack@dell-per740-66 ~]$ openstack volume create r9-qcow2-vol --size 20 --image r9-qcow2
(overcloud) [stack@dell-per740-66 ~]$ openstack volume list
+--------------------------------------+--------------+-----------+------+---------
| ID                                   | Name         | Status    | Size | Attached to |
+--------------------------------------+--------------+-----------+------+---------
| bc0c5b4d-40ff-490d-a186-a7450b53c85e | r9-qcow2-vol | available |   20 |             |
+--------------------------------------+--------------+-----------+------+---------
(overcloud) [stack@dell-per740-66 ~]$ openstack server create --flavor m2 --volume r9-qcow2-vol --nic net-id=1dba36ee-d473-4354-a5ab-d6c7b6e0e666 vm-r9-vol

(overcloud) [stack@dell-per740-66 ~]$ openstack server list
+--------------------------------------+-----------+--------+----------+-----------
| ID                                   | Name      | Status | Networks | Image                    | Flavor |
+--------------------------------------+-----------+--------+----------+-----------
| ff53ee99-69cc-4633-9434-fdf426a45929 | vm-r9-vol | ERROR  |          | N/A (booted from volume) | m2     |
+--------------------------------------+-----------+--------+----------+----------

4. Check the error in nova-conductor.log on controller node
File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 7500, in _count_mediated_devices\n    mediated_devices = self._get_mediated_devices(types=enabled_vgpu_types)\n', '  File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 7750, in _get_mediated_devices\n    device = self._get_mediated_device_information(name)\n', '  File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 7731, in _get_mediated_device_information\n    "uuid": libvirt_utils.mdev_name2uuid(cfgdev.name),\n', '  File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/utils.py", line 583, in mdev_name2uuid\n    return str(uuid.UUID(mdev_name[5:].replace(\'_\', \'-\')))\n', '  File "/usr/lib64/python3.9/uuid.py", line 177, in __init__\n    raise ValueError(\'badly formed hexadecimal UUID string\')\n', 'ValueError: badly formed hexadecimal UUID string\n'

5. Check the codes
nova/virt/libvirt/utils.py:
def mdev_name2uuid(mdev_name: str) -> str:
    """Convert an mdev name (of the form mdev_<uuid_with_underscores>) to a
    uuid (of the form 8-4-4-4-12).
    """
    return str(uuid.UUID(mdev_name[5:].replace('_', '-'))) => We need to change 
this line to not include the pci address.

More details:
mdev_name <= driver.py: _get_mediated_device_information, _get_mediated_devices:
dev_names = self._host.list_mediated_devices() or [] <= host.py: _list_devices("mdev", flags=flags), _list_devices self.get_connection().listDevices(cap, flags)

[heat-admin@compute-0 0000:3d:00.0]$ sudo podman exec -it nova_virtqemud python3
Python 3.9.10 (main, Feb  9 2022, 00:00:00) 
[GCC 11.2.1 20220127 (Red Hat 11.2.1-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import libvirt
>>> conn = libvirt.open('qemu:///system')
>>> conn.listDevices('mdev')
['mdev_b81a2fb4_1bcf_45b0_b61e_efba7f35b161_0000_3d_00_0']

>>> import uuid
>>> uuid.UUID("b81a2fb4-1bcf-45b0-b61e-efba7f35b161").version
4
>>> uuid.UUID("b81a2fb4-1bcf-45b0-b61e-efba7f35b161-0000-3d-00-0").version
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.9/uuid.py", line 177, in __init__
    raise ValueError('badly formed hexadecimal UUID string')
ValueError: badly formed hexadecimal UUID string


Actual results:
1. Failed to create VM with vGPU, hit the error: nova/virt/libvirt/utils.py
"ValueError: badly formed hexadecimal UUID string"

Expected results:
2. Create VM with the vGPU device successfully

Additional info:
- nova-conductor.log

Comment 3 Sylvain Bauza 2022-11-16 16:21:55 UTC
This is a known issue due to a new libvirtd release (7.7) that was changing the mdev names. Given we now ship this version with RHEL9 on OSP17.x that's why we're getting hit by the behavioural change without having seen it upstream before.

The tracking BZ is https://bugzilla.redhat.com/show_bug.cgi?id=2109616 and we're planning to backport the upstream changes down to 17.1 as soon as they're merged upstream in the Antelope release, hopefully during the next weeks.

Comment 4 Sylvain Bauza 2022-11-16 16:22:14 UTC

*** This bug has been marked as a duplicate of bug 2109616 ***