Bug 1984549
| Summary: | Mediated devices are not recreated upon reboot | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Sylvain Bauza <sbauza> |
| Component: | openstack-nova | Assignee: | OSP DFG:Compute <osp-dfg-compute> |
| Status: | CLOSED NOTABUG | QA Contact: | OSP DFG:Compute <osp-dfg-compute> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.1 (Train) | CC: | alifshit, dasmith, eglynn, jhakimra, kchamart, pgrist, sbauza, sgordon, vromanso, yocha |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-07-19 17:06:36 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Hi, My customer hit the bug 2077885, https://bugzilla.redhat.com/show_bug.cgi?id=2077885#c8 They didn't replace gpu card or reboot the node, so They want to know why it happend. They recently restarted a nova_compute container and re-created nova_libvirt. Could I tell them it is related to restarting a nova_compute? I'd like to get help to clarify it My confusing point is that the comment in the "Description of problem" said yes, but it needs to reboot the comment in the "Steps to Reproduce" to reproduce it. Thank you for your help! Regards, YoungCheol. Hi all, It turned out the instances with issue were created before node reboot. Regards, YoungCheol. I think I can fairly confidently remove the Regression keyword, as I'm pretty sure this was latent behaviour from the very inception of the vGPU feature, and I'm also going to remove the Triaged keyword in order to feed this through our triage process another time. Sounds like we could perhaps convert this to an RFE to make Nova use `mdevctl` and take advantage of that new libvirt RFE (BZ 1699274). Two action items now as we agreed on the compute triage meeting : #1 : deprecate automatic creation of mdevs by merging https://review.opendev.org/c/openstack/nova/+/864418 #2 : change how we create mdevs by using libvirt XML instead of sysfs : https://issues.redhat.com/browse/OSPRH-251 Closing the bug report accordingly now we have the epic. |
Description of problem: Nova tries to recreate mediated devices when restarting the nova-compute service but given it doesn't know about which PCI device was used for every mdev, it gets an Exception. Version-Release number of selected component (if applicable): From OSP16 How reproducible: Steps to Reproduce: 1. create an instance with a VGPU resource 2. reboot the host 3. see the nova-compute exception Actual results: 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service [-] Error starting thread.: libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_2265ec83_fa30_4a6c_82aa_e323e84745f3' 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service Traceback (most recent call last): 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/oslo_service/service.py", line 810, in run_service 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service service.start() 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/nova/service.py", line 172, in start 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service self.manager.init_host() 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 1397, in init_host 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service self.driver.init_host(host=self.host) 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 725, in init_host 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service self._recreate_assigned_mediated_devices() 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 799, in _recreate_assigned_mediated_devices 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service dev_info = self._get_mediated_device_information(dev_name) 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7104, in _get_mediated_device_information 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service virtdev = self._host.device_lookup_by_name(devname) 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/host.py", line 1143, in device_lookup_by_name 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service return self.get_connection().nodeDeviceLookupByName(name) 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 190, in doit 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service result = proxy_call(self._autowrap, f, *args, **kwargs) 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 148, in proxy_call 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service rv = execute(f, *args, **kwargs) 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 129, in execute 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service six.reraise(c, e, tb) 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/six.py", line 703, in reraise 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service raise value 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 83, in tworker 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service rv = meth(*args, **kwargs) 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service File "/usr/lib64/python3.6/site-packages/libvirt.py", line 4612, in nodeDeviceLookupByName 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service if ret is None:raise libvirtError('virNodeDeviceLookupByName() failed', conn=self) 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_2265ec83_fa30_4a6c_82aa_e323e84745f3' 2020-10-21 08:29:36.126 1203554 ERROR oslo_service.service Expected results: Mdevs are automatically recreated Additional info: There is a related RFE for libvirt https://bugzilla.redhat.com/show_bug.cgi?id=1699274 but this would require Nova to use mdevctl.