Bug 1984549 - Mediated devices are not recreated upon reboot
Summary: Mediated devices are not recreated upon reboot
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-21 15:24 UTC by Sylvain Bauza
Modified: 2023-07-19 17:06 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-19 17:06:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1900800 0 None None None 2021-07-21 15:24:29 UTC
Red Hat Issue Tracker OSP-6361 0 None None None 2021-11-15 13:05:29 UTC

Description Sylvain Bauza 2021-07-21 15:24:30 UTC
Description of problem:

Nova tries to recreate mediated devices when restarting the nova-compute service but given it doesn't know about which PCI device was used for every mdev, it gets an Exception.


Version-Release number of selected component (if applicable):
From OSP16

How reproducible:


Steps to Reproduce:
1. create an instance with a VGPU resource
2. reboot the host
3. see the nova-compute exception

Actual results:
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service [-] Error starting thread.: libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_2265ec83_fa30_4a6c_82aa_e323e84745f3'
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service Traceback (most recent call last):
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/oslo_service/service.py", line 810, in run_service
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     service.start()
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/nova/service.py", line 172, in start
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     self.manager.init_host()
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 1397, in init_host
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     self.driver.init_host(host=self.host)
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 725, in init_host
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     self._recreate_assigned_mediated_devices()
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 799, in _recreate_assigned_mediated_devices
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     dev_info = self._get_mediated_device_information(dev_name)
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7104, in _get_mediated_device_information
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     virtdev = self._host.device_lookup_by_name(devname)
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/host.py", line 1143, in device_lookup_by_name
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     return self.get_connection().nodeDeviceLookupByName(name)
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 190, in doit
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     result = proxy_call(self._autowrap, f, *args, **kwargs)
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 148, in proxy_call
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     rv = execute(f, *args, **kwargs)
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 129, in execute
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     six.reraise(c, e, tb)
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/six.py", line 703, in reraise
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     raise value
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 83, in tworker
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     rv = meth(*args, **kwargs)
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service   File "/usr/lib64/python3.6/site-packages/libvirt.py", line 4612, in nodeDeviceLookupByName
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service     if ret is None:raise libvirtError('virNodeDeviceLookupByName() failed', conn=self)
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_2265ec83_fa30_4a6c_82aa_e323e84745f3'
2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service


Expected results:
Mdevs are automatically recreated

Additional info:

There is a related RFE for libvirt https://bugzilla.redhat.com/show_bug.cgi?id=1699274 but this would require Nova to use mdevctl.

Comment 1 youngcheol 2022-07-05 06:49:52 UTC
Hi,

My customer hit the bug 2077885,

https://bugzilla.redhat.com/show_bug.cgi?id=2077885#c8

They didn't replace gpu card or reboot the node, so They want to know why it happend.

They recently restarted a nova_compute container and re-created nova_libvirt.


Could I tell them it is related to restarting a nova_compute?  I'd like to get help to clarify it
My confusing point is that the comment in the "Description of problem" said yes, but it needs to reboot the comment in the "Steps to Reproduce" to reproduce it.


Thank you for your help!
Regards,
YoungCheol.

Comment 2 youngcheol 2022-07-06 03:10:51 UTC
Hi all,

It turned out the instances with issue were created before node reboot.

Regards,
YoungCheol.

Comment 4 Artom Lifshitz 2023-07-13 06:36:05 UTC
I think I can fairly confidently remove the Regression keyword, as I'm pretty sure this was latent behaviour from the very inception of the vGPU feature, and I'm also going to remove the Triaged keyword in order to feed this through our triage process another time. Sounds like we could perhaps convert this to an RFE to make Nova use `mdevctl` and take advantage of that new libvirt RFE (BZ 1699274).

Comment 5 Sylvain Bauza 2023-07-19 17:06:36 UTC
Two action items now as we agreed on the compute triage meeting :

#1 : deprecate automatic creation of mdevs by merging https://review.opendev.org/c/openstack/nova/+/864418
#2 : change how we create mdevs by using libvirt XML instead of sysfs : https://issues.redhat.com/browse/OSPRH-251

Closing the bug report accordingly now we have the epic.


Note You need to log in before you can comment on or make changes to this bug.