Bug 2109616 - Nova fails to parse new libvirt mediated device name format
Summary: Nova fails to parse new libvirt mediated device name format
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ga
: 17.1
Assignee: Sylvain Bauza
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
: 2142768 (view as bug list)
Depends On: 2109450
Blocks: 1761861
TreeView+ depends on / blocked
 
Reported: 2022-07-21 15:46 UTC by Sylvain Bauza
Modified: 2023-09-07 15:16 UTC (History)
17 users (show)

Fixed In Version: openstack-nova-23.2.3-1.20221209191244.bbf626c.el9ost
Doc Type: Bug Fix
Doc Text:
Before this update, the Compute service was unable to determine the VGPU resource use because the mediated device name format changed in libvirt 7.7. With this update, the Compute service can now parse the new mediated device name format.
Clone Of:
Environment:
Last Closed: 2023-08-16 01:11:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1951656 0 None None None 2022-07-21 15:46:06 UTC
OpenStack gerrit 838976 0 None NEW Handle mdev devices in libvirt 7.7+ 2022-08-03 10:16:51 UTC
OpenStack gerrit 864418 0 None NEW Deprecate mdev creation and hardfail on reboot when missing. 2022-11-16 16:24:14 UTC
Red Hat Issue Tracker OSP-17786 0 None None None 2022-07-21 15:51:09 UTC
Red Hat Product Errata RHEA-2023:4577 0 None None None 2023-08-16 01:11:57 UTC

Description Sylvain Bauza 2022-07-21 15:46:07 UTC
Description of problem:

With libvirt 7.7, mediated device names changed, so now Nova isn't able to find them.
The impact is not trivial to see, but basically, the update of resources we do every 60 secs is now having an exception so we don't really know the right VGPU capacity left.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create an instance with a VGPU flavor
2. look at the n-cpu log, you'll see an exception every 60 secs



2021-11-19 22:51:45.952 7 ERROR nova.compute.manager [req-570c7e8f-0540-49fb-b2b0-8c2ac932e4dc - - - - -] Error updating resources for node: ValueError: badly formed hexadecimal UUID string
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager Traceback (most recent call last):
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 9993, in _update_available_resource_for_node
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager startup=startup)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 895, in update_available_resource
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self._update_available_resource(context, resources, startup=startup)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 360, in inner
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return f(*args, **kwargs)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 975, in _update_available_resource
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self._update(context, cn, startup=startup)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1227, in _update
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self._update_to_placement(context, compute_node, startup)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return Retrying(*dargs, **dkw).call(f, *args, **kw)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 206, in call
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return attempt.get(self._wrap_exception)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 247, in get
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager six.reraise(self.value[0], self.value[1], self.value[2])
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/usr/local/lib/python3.6/site-packages/six.py", line 719, in reraise
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager raise value
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 200, in call
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1163, in _update_to_placement
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self.driver.update_provider_tree(prov_tree, nodename)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8355, in update_provider_tree
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager provider_tree, nodename, allocations=allocations)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8757, in _update_provider_tree_for_vgpu
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager inventories_dict = self._get_gpu_inventories()
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7597, in _get_gpu_inventories
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager count_per_parent = self._count_mediated_devices(enabled_mdev_types)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7538, in _count_mediated_devices
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager mediated_devices = self._get_mediated_devices(types=enabled_mdev_types)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7788, in _get_mediated_devices
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager device = self._get_mediated_device_information(name)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7769, in _get_mediated_device_information
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager "uuid": libvirt_utils.mdev_name2uuid(cfgdev.name),
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/utils.py", line 583, in mdev_name2uuid
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return str(uuid.UUID(mdev_name[5:].replace('_', '-')))
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/usr/lib64/python3.6/uuid.py", line 140, in __init__
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager raise ValueError('badly formed hexadecimal UUID string')
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager ValueError: badly formed hexadecimal UUID string
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager


A proposal fix against upstrea master is already on the fly, we need to backport it ASAP once it's merged down to 17.0.
https://review.opendev.org/c/openstack/nova/+/838976

Comment 3 Artom Lifshitz 2022-08-23 16:00:40 UTC
This will need a bug fix doctext because of the following known issue relese note: https://bugzilla.redhat.com/show_bug.cgi?id=2120726

Comment 4 Sylvain Bauza 2022-11-16 16:22:14 UTC
*** Bug 2142768 has been marked as a duplicate of this bug. ***

Comment 5 Jonathon Jongsma 2022-11-16 16:35:55 UTC
Hi Sylvain,

If I'm reading things correctly, it looks like the proposed fix here is still attempting to parse the data from the nodedev name. From libvirt's point of view, the nodedev name is just a unique opaque string that identifies a device. There is no guarantee that it is stable or can be parsed for information (thus, the change in name). Admittedly, that was previously the only way to get the UUID for the mdev, but versions of libvirt after 7.3.0 provide the uuid in the nodedev xml, so you can use that instead of parsing the UUID from the name. (You may need to fall back to name parsing the name for older versions of libvirt though).  

Here's a similar virt-manager/virt-install issue for reference: https://github.com/virt-manager/virt-manager/pull/319

Comment 6 spower 2022-11-22 16:25:13 UTC
We are not in a blockers stage for 17.1 yet so no need to propose a blocker for OSP 17.1. I will remove the flag.

Comment 7 Sylvain Bauza 2022-11-30 08:27:10 UTC
(In reply to Jonathon Jongsma from comment #5)
> Admittedly, that was
> previously the only way to get the UUID for the mdev, but versions of
> libvirt after 7.3.0 provide the uuid in the nodedev xml, so you can use that
> instead of parsing the UUID from the name. (You may need to fall back to
> name parsing the name for older versions of libvirt though).  
> 
> Here's a similar virt-manager/virt-install issue for reference:
> https://github.com/virt-manager/virt-manager/pull/319

Hi Jonathan, thanks a lot for helping us to resolve this issue. Indeed we found that using the mdev name was a fragile API so the change is now using the uuid from the XML :
https://code.engineering.redhat.com/gerrit/c/nova/+/436530/1/nova/virt/libvirt/driver.py#7768

Anyway, thanks ;-)

Btw, I just backported the master changes to the 17.1 branch :
https://code.engineering.redhat.com/gerrit/c/nova/+/436529
https://code.engineering.redhat.com/gerrit/c/nova/+/436530


Putting then the BZ to POST.

Comment 8 Sylvain Bauza 2022-12-06 17:09:04 UTC
We had some issues we needed to resolve so now the two patches are https://code.engineering.redhat.com/gerrit/c/nova/+/437157 and https://code.engineering.redhat.com/gerrit/c/nova/+/436530/

Putting to POST again.

Comment 25 errata-xmlrpc 2023-08-16 01:11:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577


Note You need to log in before you can comment on or make changes to this bug.