Bug 2142768 - Failed to create VM with vGPU - Hit error "badly formed hexadecimal UUID string"
Summary: Failed to create VM with vGPU - Hit error "badly formed hexadecimal UUID string"
Keywords:
Status: CLOSED DUPLICATE of bug 2109616
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 17.0 (Wallaby)
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard: libvirt_OSP_INT
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-11-15 08:35 UTC by chhu
Modified: 2023-03-21 20:00 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-16 16:22:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-20178 0 None None None 2022-11-15 08:38:16 UTC

Description chhu 2022-11-15 08:35:55 UTC
Description of problem:
Failed to create VM with vGPU, hit the error: nova/virt/libvirt/utils.py
"ValueError: badly formed hexadecimal UUID string"

Version-Release number of selected component (if applicable):
rhosp17-openstack-nova-compute:17.0_20220908.1
python3-nova-23.2.2-0.20220720130412.7074ac0.el9ost.noarch
python3-novaclient-17.4.0-0.20210812172018.54d4da1.el9ost.noarch
openstack-nova-common-23.2.2-0.20220720130412.7074ac0.el9ost.noarch
openstack-nova-compute-23.2.2-0.20220720130412.7074ac0.el9ost.noarch
openstack-nova-migration-23.2.2-0.20220720130412.7074ac0.el9ost.noarch

Use libvirt with the fix for Bug 2109450 - libvirt doesn't catch mdevs created thru sysfs

How reproducible:
100%

Steps to Reproduce:
1. Prepare the vGPU environment on OSP17.0
(undercloud) [stack@dell-per740-66 ~]$ ssh heat-admin.24.10
[heat-admin@compute-0 ~]$ lspci|grep VGA
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
3d:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)
3e:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)

[heat-admin@compute-0 ~]$ cat /sys/class/mdev_bus/0000\:3d\:00.0/mdev_supported_types/nvidia-22/name 
GRID M60-8Q
[heat-admin@compute-0 ~]$ uuid=$(uuidgen)
[heat-admin@compute-0 ~]$ cd /sys/class/mdev_bus/0000:3d:00.0/mdev_supported_types/nvidia-22
[heat-admin@compute-0 nvidia-22]$ sudo chmod 666 create
[heat-admin@compute-0 nvidia-22]$ sudo echo $uuid
b81a2fb4-1bcf-45b0-b61e-efba7f35b161
[heat-admin@compute-0 nvidia-22]$ sudo echo $uuid > create
[heat-admin@compute-0 nvidia-22]$ cd ../../
[heat-admin@compute-0 0000:3d:00.0]$ ls
b81a2fb4-1bcf-45b0-b61e-efba7f35b161  d3cold_allowed            iommu            mdev_supported_types  rescan        resource3_wc

2. Check in nova_virtqemud, mdev is present in the list of node devices
[heat-admin@compute-0 ~]$ sudo podman exec -it nova_virtqemud python3
Python 3.9.10 (main, Feb  9 2022, 00:00:00) 
[GCC 11.2.1 20220127 (Red Hat 11.2.1-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import libvirt
>>> conn = libvirt.open('qemu:///system')
>>> conn.listDevices('mdev')
['mdev_b81a2fb4_1bcf_45b0_b61e_efba7f35b161_0000_3d_00_0']

3. Try to start VM with the vGPU device, hit error
(overcloud) [stack@dell-per740-66 ~]$ openstack flavor create --vcpus 6 --ram 4196 --disk 20 m2
(overcloud) [stack@dell-per740-66 ~]$ openstack flavor set m2 --property "resources:VGPU=1"
(overcloud) [stack@dell-per740-66 ~]$ openstack network create default
(overcloud) [stack@dell-per740-66 ~]$ openstack network list
+--------------------------------------+---------+---------------------------------
| ID                                   | Name    | Subnets                              |
+--------------------------------------+---------+---------------------------------
| 1dba36ee-d473-4354-a5ab-d6c7b6e0e666 | default | 2d6a822b-a511-47c2-918e-37ee947c0a8d |
+--------------------------------------+---------+---------------------------------

(overcloud) [stack@dell-per740-66 ~]$ openstack subnet create default --network default --gateway 192.168.32.1 --subnet-range 192.168.32.0/24
(overcloud) [stack@dell-per740-66 ~]$ openstack image create r9-qcow2 --disk-format qcow2 --container-format bare --file RHEL-9.0-x86_64-latest.qcow2
(overcloud) [stack@dell-per740-66 ~]$ openstack volume create r9-qcow2-vol --size 20 --image r9-qcow2
(overcloud) [stack@dell-per740-66 ~]$ openstack volume list
+--------------------------------------+--------------+-----------+------+---------
| ID                                   | Name         | Status    | Size | Attached to |
+--------------------------------------+--------------+-----------+------+---------
| bc0c5b4d-40ff-490d-a186-a7450b53c85e | r9-qcow2-vol | available |   20 |             |
+--------------------------------------+--------------+-----------+------+---------
(overcloud) [stack@dell-per740-66 ~]$ openstack server create --flavor m2 --volume r9-qcow2-vol --nic net-id=1dba36ee-d473-4354-a5ab-d6c7b6e0e666 vm-r9-vol

(overcloud) [stack@dell-per740-66 ~]$ openstack server list
+--------------------------------------+-----------+--------+----------+-----------
| ID                                   | Name      | Status | Networks | Image                    | Flavor |
+--------------------------------------+-----------+--------+----------+-----------
| ff53ee99-69cc-4633-9434-fdf426a45929 | vm-r9-vol | ERROR  |          | N/A (booted from volume) | m2     |
+--------------------------------------+-----------+--------+----------+----------

4. Check the error in nova-conductor.log on controller node
File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 7500, in _count_mediated_devices\n    mediated_devices = self._get_mediated_devices(types=enabled_vgpu_types)\n', '  File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 7750, in _get_mediated_devices\n    device = self._get_mediated_device_information(name)\n', '  File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 7731, in _get_mediated_device_information\n    "uuid": libvirt_utils.mdev_name2uuid(cfgdev.name),\n', '  File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/utils.py", line 583, in mdev_name2uuid\n    return str(uuid.UUID(mdev_name[5:].replace(\'_\', \'-\')))\n', '  File "/usr/lib64/python3.9/uuid.py", line 177, in __init__\n    raise ValueError(\'badly formed hexadecimal UUID string\')\n', 'ValueError: badly formed hexadecimal UUID string\n'

5. Check the codes
nova/virt/libvirt/utils.py:
def mdev_name2uuid(mdev_name: str) -> str:
    """Convert an mdev name (of the form mdev_<uuid_with_underscores>) to a
    uuid (of the form 8-4-4-4-12).
    """
    return str(uuid.UUID(mdev_name[5:].replace('_', '-'))) => We need to change 
this line to not include the pci address.

More details:
mdev_name <= driver.py: _get_mediated_device_information, _get_mediated_devices:
dev_names = self._host.list_mediated_devices() or [] <= host.py: _list_devices("mdev", flags=flags), _list_devices self.get_connection().listDevices(cap, flags)

[heat-admin@compute-0 0000:3d:00.0]$ sudo podman exec -it nova_virtqemud python3
Python 3.9.10 (main, Feb  9 2022, 00:00:00) 
[GCC 11.2.1 20220127 (Red Hat 11.2.1-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import libvirt
>>> conn = libvirt.open('qemu:///system')
>>> conn.listDevices('mdev')
['mdev_b81a2fb4_1bcf_45b0_b61e_efba7f35b161_0000_3d_00_0']

>>> import uuid
>>> uuid.UUID("b81a2fb4-1bcf-45b0-b61e-efba7f35b161").version
4
>>> uuid.UUID("b81a2fb4-1bcf-45b0-b61e-efba7f35b161-0000-3d-00-0").version
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.9/uuid.py", line 177, in __init__
    raise ValueError('badly formed hexadecimal UUID string')
ValueError: badly formed hexadecimal UUID string


Actual results:
1. Failed to create VM with vGPU, hit the error: nova/virt/libvirt/utils.py
"ValueError: badly formed hexadecimal UUID string"

Expected results:
2. Create VM with the vGPU device successfully

Additional info:
- nova-conductor.log

Comment 3 Sylvain Bauza 2022-11-16 16:21:55 UTC
This is a known issue due to a new libvirtd release (7.7) that was changing the mdev names. Given we now ship this version with RHEL9 on OSP17.x that's why we're getting hit by the behavioural change without having seen it upstream before.

The tracking BZ is https://bugzilla.redhat.com/show_bug.cgi?id=2109616 and we're planning to backport the upstream changes down to 17.1 as soon as they're merged upstream in the Antelope release, hopefully during the next weeks.

Comment 4 Sylvain Bauza 2022-11-16 16:22:14 UTC

*** This bug has been marked as a duplicate of bug 2109616 ***


Note You need to log in before you can comment on or make changes to this bug.