Bug 2259641 - [17.1] [QE Tracker] Nova errors out due to libvirt failing to parse PCI device VPD (virtual private data)
Summary: [17.1] [QE Tracker] Nova errors out due to libvirt failing to parse PCI devic...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 17.1 (Wallaby)
Hardware: All
OS: All
low
urgent
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-01-22 14:10 UTC by Alex Stupnikov
Modified: 2024-04-30 17:42 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-04-30 17:42:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-31234 0 None None None 2024-01-22 14:14:52 UTC

Description Alex Stupnikov 2024-01-22 14:10:35 UTC
Description of problem:

update_available_resource fails because some PCI devices have invalid XML format [1]. This situation is caused by VPD NICs having the following field in their dumpxml:
        <vendor_field index='Z'>6<1</vendor_field>

It looks like a bug in libvirt, so I am reporting it to engineering to get help ASAP: many more customers can be potentially affected because we have many ongoing upgrade processes.

[1]
2024-01-19 10:06:52.927 2 DEBUG nova.compute.resource_tracker [req-49fd3ccf-12b6-405d-8c9d-985ae1184e27 - - - - -] Auditing locally available compute resources for compute.example.com (node: compute.example.com) update_available_resource /usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py:880
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager [req-49fd3ccf-12b6-405d-8c9d-985ae1184e27 - - - - -] Error updating resources for node compute.example.com.:   File "<string>", line 40

2024-01-19 10:06:54.733 2 ERROR nova.compute.manager Traceback (most recent call last):
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 10008, in _update_available_resource_for_node
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager     self.rt.update_available_resource(context, nodename,
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 884, in update_available_resource
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager     resources = self.driver.get_available_resource(nodename)
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 9163, in get_available_resource
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager     data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 7816, in _get_pci_passthrough_devices
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager     pci_info = [
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 7817, in <listcomp>
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager     self._host._get_pcidev_info(name, dev, net_devs, vdpa_devs)
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/host.py", line 1319, in _get_pcidev_info
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager     cfgdev.parse_str(xmlstr)
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/config.py", line 73, in parse_str
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager     self.parse_dom(etree.fromstring(xmlstr))
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "src/lxml/parser.pxi", line 1899, in lxml.etree._parseMemoryDocument
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "src/lxml/parser.pxi", line 1780, in lxml.etree._parseDoc
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager   File "<string>", line 40
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager lxml.etree.XMLSyntaxError: StartTag: invalid element name, line 40, column 35
2024-01-19 10:06:54.733 2 ERROR nova.compute.manager

Version-Release number of selected component (if applicable):
RHOSP 17.1
rhosp-rhel9/openstack-nova-libvirt 17.1 sha256:bf5310f5839bd8648fc9a26d17e2d57230ca25baa27bd86c9315688102d4ad95  207e3ea7ed6e  7 weeks ago   1.54 GB


How reproducible:
This problem is hardware-dependant and is reproduced without extra steps when nova-compute is started on problematic hardware


Actual results:
Libvirt XML dumps have invalid format, Nova update_available_resource is blocked.

Expected results:
Libvirt XML dumps are valid (bug in libvirt), if XML dump has invalid format, then Nova reports errors, but its operations are not blocked (space for improvement on Nova side). 

Additional info:
Bug #2259636 was reported to request improved logging for problematic XML dumps.

Comment 7 Kashyap Chamarthy 2024-01-25 15:42:34 UTC
I've adjusted the bug title to a "QE Tracker" as we're dependent on the  libvirt bug here.  Here's the libvirt (publicly) accessible issue:


https://issues.redhat.com/browse/RHEL-22314 — libvirt failing to parse PCI device VPD (virtual private data) for some hardware

Comment 8 smooney 2024-04-30 17:42:23 UTC
this was backported all the way to rhel 9.0 in January
https://issues.redhat.com/browse/RHEL-22398
and it  was released in 9.2 as well in march 2024/03/05 https://issues.redhat.com/browse/RHEL-22399

the fixed in build was libvirt-9.0.0-10.4.el9_2 


the current lastes 17.1.2 tag is 17.1.2-5.1712881171 and it contained libvirt-daemon-9.0.0-10.5.el9_2.x86_64

so closing this as current release as is was shipped to the cdn by the automatic container rebuilds when the rhel release was made to the cdn.
its also available in 17.1.2-5.1709836652, and 17.1.2-5.1709628728


Note You need to log in before you can comment on or make changes to this bug.