Bug 2109616

Summary: Nova fails to parse new libvirt mediated device name format
Product: Red Hat OpenStack Reporter: Sylvain Bauza <sbauza>
Component: openstack-novaAssignee: Sylvain Bauza <sbauza>
Status: CLOSED ERRATA QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: medium Docs Contact:
Priority: medium    
Version: 17.0 (Wallaby)CC: alifshit, chhu, dasmith, eglynn, igallagh, jamsmith, jhakimra, jjongsma, joflynn, jparker, juzhou, kchamart, sbauza, sgordon, smooney, spower, vromanso
Target Milestone: gaKeywords: Patch, Regression, Triaged, UpgradeBlocker
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-nova-23.2.3-1.20221209191244.bbf626c.el9ost Doc Type: Bug Fix
Doc Text:
Before this update, the Compute service was unable to determine the VGPU resource use because the mediated device name format changed in libvirt 7.7. With this update, the Compute service can now parse the new mediated device name format.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-16 01:11:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2109450    
Bug Blocks: 1761861    

Description Sylvain Bauza 2022-07-21 15:46:07 UTC
Description of problem:

With libvirt 7.7, mediated device names changed, so now Nova isn't able to find them.
The impact is not trivial to see, but basically, the update of resources we do every 60 secs is now having an exception so we don't really know the right VGPU capacity left.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create an instance with a VGPU flavor
2. look at the n-cpu log, you'll see an exception every 60 secs



2021-11-19 22:51:45.952 7 ERROR nova.compute.manager [req-570c7e8f-0540-49fb-b2b0-8c2ac932e4dc - - - - -] Error updating resources for node: ValueError: badly formed hexadecimal UUID string
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager Traceback (most recent call last):
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 9993, in _update_available_resource_for_node
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager startup=startup)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 895, in update_available_resource
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self._update_available_resource(context, resources, startup=startup)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 360, in inner
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return f(*args, **kwargs)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 975, in _update_available_resource
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self._update(context, cn, startup=startup)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1227, in _update
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self._update_to_placement(context, compute_node, startup)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return Retrying(*dargs, **dkw).call(f, *args, **kw)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 206, in call
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return attempt.get(self._wrap_exception)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 247, in get
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager six.reraise(self.value[0], self.value[1], self.value[2])
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/usr/local/lib/python3.6/site-packages/six.py", line 719, in reraise
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager raise value
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 200, in call
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1163, in _update_to_placement
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self.driver.update_provider_tree(prov_tree, nodename)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8355, in update_provider_tree
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager provider_tree, nodename, allocations=allocations)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8757, in _update_provider_tree_for_vgpu
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager inventories_dict = self._get_gpu_inventories()
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7597, in _get_gpu_inventories
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager count_per_parent = self._count_mediated_devices(enabled_mdev_types)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7538, in _count_mediated_devices
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager mediated_devices = self._get_mediated_devices(types=enabled_mdev_types)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7788, in _get_mediated_devices
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager device = self._get_mediated_device_information(name)
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7769, in _get_mediated_device_information
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager "uuid": libvirt_utils.mdev_name2uuid(cfgdev.name),
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/utils.py", line 583, in mdev_name2uuid
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return str(uuid.UUID(mdev_name[5:].replace('_', '-')))
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/usr/lib64/python3.6/uuid.py", line 140, in __init__
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager raise ValueError('badly formed hexadecimal UUID string')
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager ValueError: badly formed hexadecimal UUID string
2021-11-19 22:51:45.952 7 ERROR nova.compute.manager


A proposal fix against upstrea master is already on the fly, we need to backport it ASAP once it's merged down to 17.0.
https://review.opendev.org/c/openstack/nova/+/838976

Comment 3 Artom Lifshitz 2022-08-23 16:00:40 UTC
This will need a bug fix doctext because of the following known issue relese note: https://bugzilla.redhat.com/show_bug.cgi?id=2120726

Comment 4 Sylvain Bauza 2022-11-16 16:22:14 UTC
*** Bug 2142768 has been marked as a duplicate of this bug. ***

Comment 5 Jonathon Jongsma 2022-11-16 16:35:55 UTC
Hi Sylvain,

If I'm reading things correctly, it looks like the proposed fix here is still attempting to parse the data from the nodedev name. From libvirt's point of view, the nodedev name is just a unique opaque string that identifies a device. There is no guarantee that it is stable or can be parsed for information (thus, the change in name). Admittedly, that was previously the only way to get the UUID for the mdev, but versions of libvirt after 7.3.0 provide the uuid in the nodedev xml, so you can use that instead of parsing the UUID from the name. (You may need to fall back to name parsing the name for older versions of libvirt though).  

Here's a similar virt-manager/virt-install issue for reference: https://github.com/virt-manager/virt-manager/pull/319

Comment 6 spower 2022-11-22 16:25:13 UTC
We are not in a blockers stage for 17.1 yet so no need to propose a blocker for OSP 17.1. I will remove the flag.

Comment 7 Sylvain Bauza 2022-11-30 08:27:10 UTC
(In reply to Jonathon Jongsma from comment #5)
> Admittedly, that was
> previously the only way to get the UUID for the mdev, but versions of
> libvirt after 7.3.0 provide the uuid in the nodedev xml, so you can use that
> instead of parsing the UUID from the name. (You may need to fall back to
> name parsing the name for older versions of libvirt though).  
> 
> Here's a similar virt-manager/virt-install issue for reference:
> https://github.com/virt-manager/virt-manager/pull/319

Hi Jonathan, thanks a lot for helping us to resolve this issue. Indeed we found that using the mdev name was a fragile API so the change is now using the uuid from the XML :
https://code.engineering.redhat.com/gerrit/c/nova/+/436530/1/nova/virt/libvirt/driver.py#7768

Anyway, thanks ;-)

Btw, I just backported the master changes to the 17.1 branch :
https://code.engineering.redhat.com/gerrit/c/nova/+/436529
https://code.engineering.redhat.com/gerrit/c/nova/+/436530


Putting then the BZ to POST.

Comment 8 Sylvain Bauza 2022-12-06 17:09:04 UTC
We had some issues we needed to resolve so now the two patches are https://code.engineering.redhat.com/gerrit/c/nova/+/437157 and https://code.engineering.redhat.com/gerrit/c/nova/+/436530/

Putting to POST again.

Comment 25 errata-xmlrpc 2023-08-16 01:11:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577