Description of problem: With libvirt 7.7, mediated device names changed, so now Nova isn't able to find them. The impact is not trivial to see, but basically, the update of resources we do every 60 secs is now having an exception so we don't really know the right VGPU capacity left. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Create an instance with a VGPU flavor 2. look at the n-cpu log, you'll see an exception every 60 secs 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager [req-570c7e8f-0540-49fb-b2b0-8c2ac932e4dc - - - - -] Error updating resources for node: ValueError: badly formed hexadecimal UUID string 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager Traceback (most recent call last): 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 9993, in _update_available_resource_for_node 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager startup=startup) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 895, in update_available_resource 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self._update_available_resource(context, resources, startup=startup) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 360, in inner 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return f(*args, **kwargs) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 975, in _update_available_resource 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self._update(context, cn, startup=startup) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1227, in _update 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self._update_to_placement(context, compute_node, startup) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return Retrying(*dargs, **dkw).call(f, *args, **kw) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 206, in call 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return attempt.get(self._wrap_exception) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 247, in get 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager six.reraise(self.value[0], self.value[1], self.value[2]) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/usr/local/lib/python3.6/site-packages/six.py", line 719, in reraise 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager raise value 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/retrying.py", line 200, in call 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager attempt = Attempt(fn(*args, **kwargs), attempt_number, False) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1163, in _update_to_placement 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager self.driver.update_provider_tree(prov_tree, nodename) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8355, in update_provider_tree 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager provider_tree, nodename, allocations=allocations) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8757, in _update_provider_tree_for_vgpu 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager inventories_dict = self._get_gpu_inventories() 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7597, in _get_gpu_inventories 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager count_per_parent = self._count_mediated_devices(enabled_mdev_types) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7538, in _count_mediated_devices 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager mediated_devices = self._get_mediated_devices(types=enabled_mdev_types) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7788, in _get_mediated_devices 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager device = self._get_mediated_device_information(name) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7769, in _get_mediated_device_information 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager "uuid": libvirt_utils.mdev_name2uuid(cfgdev.name), 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/utils.py", line 583, in mdev_name2uuid 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager return str(uuid.UUID(mdev_name[5:].replace('_', '-'))) 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager File "/usr/lib64/python3.6/uuid.py", line 140, in __init__ 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager raise ValueError('badly formed hexadecimal UUID string') 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager ValueError: badly formed hexadecimal UUID string 2021-11-19 22:51:45.952 7 ERROR nova.compute.manager A proposal fix against upstrea master is already on the fly, we need to backport it ASAP once it's merged down to 17.0. https://review.opendev.org/c/openstack/nova/+/838976
Moving to 17.1, see https://bugzilla.redhat.com/show_bug.cgi?id=2116979 and https://bugzilla.redhat.com/show_bug.cgi?id=2116980.
This will need a bug fix doctext because of the following known issue relese note: https://bugzilla.redhat.com/show_bug.cgi?id=2120726
*** Bug 2142768 has been marked as a duplicate of this bug. ***
Hi Sylvain, If I'm reading things correctly, it looks like the proposed fix here is still attempting to parse the data from the nodedev name. From libvirt's point of view, the nodedev name is just a unique opaque string that identifies a device. There is no guarantee that it is stable or can be parsed for information (thus, the change in name). Admittedly, that was previously the only way to get the UUID for the mdev, but versions of libvirt after 7.3.0 provide the uuid in the nodedev xml, so you can use that instead of parsing the UUID from the name. (You may need to fall back to name parsing the name for older versions of libvirt though). Here's a similar virt-manager/virt-install issue for reference: https://github.com/virt-manager/virt-manager/pull/319
We are not in a blockers stage for 17.1 yet so no need to propose a blocker for OSP 17.1. I will remove the flag.
(In reply to Jonathon Jongsma from comment #5) > Admittedly, that was > previously the only way to get the UUID for the mdev, but versions of > libvirt after 7.3.0 provide the uuid in the nodedev xml, so you can use that > instead of parsing the UUID from the name. (You may need to fall back to > name parsing the name for older versions of libvirt though). > > Here's a similar virt-manager/virt-install issue for reference: > https://github.com/virt-manager/virt-manager/pull/319 Hi Jonathan, thanks a lot for helping us to resolve this issue. Indeed we found that using the mdev name was a fragile API so the change is now using the uuid from the XML : https://code.engineering.redhat.com/gerrit/c/nova/+/436530/1/nova/virt/libvirt/driver.py#7768 Anyway, thanks ;-) Btw, I just backported the master changes to the 17.1 branch : https://code.engineering.redhat.com/gerrit/c/nova/+/436529 https://code.engineering.redhat.com/gerrit/c/nova/+/436530 Putting then the BZ to POST.
We had some issues we needed to resolve so now the two patches are https://code.engineering.redhat.com/gerrit/c/nova/+/437157 and https://code.engineering.redhat.com/gerrit/c/nova/+/436530/ Putting to POST again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:4577