Bug 1847924
Summary: | [SRIOV - Cold Migration] some issues after a server with a VF port is migrated | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eduardo Olivares <eolivare> |
Component: | openstack-nova | Assignee: | smooney |
Status: | CLOSED ERRATA | QA Contact: | OSP DFG:Compute <osp-dfg-compute> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 16.1 (Train) | CC: | atragler, berrange, dasmith, egallen, eglynn, hakhande, jhakimra, jparker, kchamart, oblaut, sbauza, sclewis, sgordon, smooney, spower, stephenfin, vromanso |
Target Milestone: | ga | Keywords: | Patch, Regression, Triaged |
Target Release: | 16.1 (Train on RHEL 8.2) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-nova-20.3.1-0.20200626213433.38ee1f3.el8ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-07-29 07:53:29 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1666684 |
Description
Eduardo Olivares
2020-06-17 11:05:49 UTC
i will need to look into this more but i think this is related to the fact you are trying to use a single inteface for PF passthough after it has been used for VF passthough. in step 3 you would have to wait for the vm to be fully delete before checking the db to see the correct value but assuming you didi the avaiable vf should have reset. there is however a dely in udev notifying libvirt of the fact the inteface is now avaiable again so perhaps that coudl be a factor. ill take a look a the logs and we we triage this tomorrow. the behavior in step 4 is expected and correct if the nova db still reports one of the VFs as in use. so if there is a bug here its only the bug in step 3 step 4 is correct. we do not allow the PF to be assgined if any of its VFs are marked as in use. just quickly looking at the nova log its littered with ibvirt.libvirtError: Node device not found: no node device with matching name 'net_enp7s0f3v0_2a_a8_d8_cd_13_c6' messages so i suspect that there is some issue with either the vf name changeing, it not being bound back to the nic driver or we are getting stale data form libvirt due to how it caches node devices and how it expects to be updated via udev. specificaly this is what the error looks like 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager [req-74b38a2e-6269-4696-8c54-e897e7a3baa4 - - - - -] Error updating resources for node computesriov-1.localdomain.: libvirt.libvirtError: Node device not found: no node device with matching name 'net_enp7s0f3v1_ea_60_77_1f_21_50' 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager Traceback (most recent call last): 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 8740, in _update_available_resource_for_node 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager startup=startup) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 871, in update_available_resource 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8034, in get_available_resource 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager data['pci_passthrough_devices'] = self._get_pci_passthrough_devices() 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6795, in _get_pci_passthrough_devices 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager pci_info.append(self._get_pcidev_info(name)) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6756, in _get_pcidev_info 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager device.update(_get_device_capabilities(device, address)) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6727, in _get_device_capabilities 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager pcinet_info = self._get_pcinet_info(address) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6670, in _get_pcinet_info 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager virtdev = self._host.device_lookup_by_name(devname) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/host.py", line 1147, in device_lookup_by_name 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager return self.get_connection().nodeDeviceLookupByName(name) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 190, in doit 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager result = proxy_call(self._autowrap, f, *args, **kwargs) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 148, in proxy_call 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager rv = execute(f, *args, **kwargs) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 129, in execute 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager six.reraise(c, e, tb) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager raise value 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/eventlet/tpool.py", line 83, in tworker 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager rv = meth(*args, **kwargs) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager File "/usr/lib64/python3.6/site-packages/libvirt.py", line 4612, in nodeDeviceLookupByName 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager if ret is None:raise libvirtError('virNodeDeviceLookupByName() failed', conn=self) 2020-06-17 10:36:56.563 7 ERROR nova.compute.manager libvirt.libvirtError: Node device not found: no node device with matching name 'net_enp7s0f3v1_ea_60_77_1f_21_50' since its failing in the device lookup from libvirt we are not going to be able to update the resouce tracker properly which is likely why nova thinks the pci device is in use. can you provide a dump of the pci_devices table for both hosts so we can see the pci whitelist appears to be passthrough_whitelist={"devname":"enp6s0f3","physical_network":"datacentre","trusted":"true"} passthrough_whitelist={"devname":"enp7s0f3","physical_network":"datacentre","trusted":"true"} which makes sence based on teh the name 'net_enp7s0f3v1_ea_60_77_1f_21_50 looking at the ip link output net_enp7s0f3v1_ea_60_77_1f_21_50 is present with the name and mac libvirt is expecting 205: enp7s0f3v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ea:60:77:1f:21:50 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9702 addrgenmode none numtxqueues 4 numrxqueues 4 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped overrun mcast 86379 451 0 0 0 0 TX: bytes packets errors dropped carrier collsns 26698 169 0 0 0 0 so it looks like libvirt is not in sync with the current system state and is returning cached data. There's a lot of information in this bug, including references to additional bugs with the SR-IOV functionality. To summarize *this* bug, nova is doing a lookup of an SR-IOV VIF using libvirt's 'device_lookup_by_name' API and this call is failing, resulting in knock on effects elsewhere. The 'device_lookup_by_name' API expects a unique device name, which for network devices takes the format 'net_{ifname}_{mac}'. For some reason, the MAC address that nova thinks the NIC has is getting out-of-sync with libvirt's cached list of devices, causing the "Node device not found" error messages described in comment 3 above. The solution to this is likely to be in two parts: a short term fix to simply ignore any errors raised during the lookup, since the lookup is part of code that's non-critical, and a second, more expansive fix to change the API we're using and look up the device by hardware ID instead, avoiding the caching issue. The minimal fix is now merged on master and a downstream backport is inflight. it will merge once tempest has completed and we make a minor change to the commit message later today. Verified on RHOS-16.1-RHEL-8-20200714.n.0, with openstack-nova-common-20.3.1-0.20200626213433.38ee1f3.el8ost.noarch Following the reproduction steps from this bug's description, the issue is not reproduced: 1) Create VM with VF port OK 2) Migrate that VM OK (tested both cold and live migration) 3) Remove server and VF ports previously created OK - the pci_stats counters from the nova DB are correct, showing all the available VF and PF ports from the computes. 4) Create two VMs with PF ports VMs are created successfully The following WARNING message is printed on nova-compute.log frequently: 2020-07-16 18:21:16.803 7 WARNING nova.virt.libvirt.driver [req-f139bfa9-21bb-423a-ba37-86ce58fe1d33 - - - - -] Node device not found: no node device with matching name 'net_enp6s0f3v1_e6_aa_8c_ef_0d_ce': libvirt.libvirtError: Node device not found: no node device with matching name 'net_enp6s0f3v1_e6_aa_8c_ef_0d_ce' However, no related harmful effect has been detected, so this can be consider a minor issue that could be resolved separately. If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to -. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3148 |