Description of problem: cold migration fails for sriov instance when the VF port's PCI location is in use on destination host. It seems that VF location is not re-calculated on destination during migration. Error from nova log: 2020-12-29 12:27:27.921 8 ERROR nova.compute.manager [req-XXX XXX XXX - default default] [instance: XXX ] Setting instance vm_state to ERROR: PortUpdateFailed: Port update failed for port XXX: Unable to correlate PCI slot 00 00:af:1c.5 Version-Release number of selected component (if applicable): OSP 13 openstack-neutron-ml2-12.1.1-35.1.el7ost.noarch openstack-neutron-openvswitch-12.1.1-35.1.el7ost.noarch openstack-neutron-12.1.1-35.1.el7ost.noarch openstack-neutron-fwaas-12.0.2-1.el7ost.noarch openstack-neutron-common-12.1.1-35.1.el7ost.noarch openstack-neutron-lbaas-12.0.1-0.20190803015156.b86fcef.el7ost.noarch openstack-nova-common-17.0.13-27.el7ost.noarch How reproducible: 100% in this environment where destination VF pci location is in use. Steps to Reproduce: 1. stop VM 2. cold migrate VM 3. Additional info: Perhaps related to these but wanted to open a new bz to verify: https://bugzilla.redhat.com/show_bug.cgi?id=1767797 https://bugzilla.redhat.com/show_bug.cgi?id=1852110 I'll provide env specific details and logs as private attachments for review.
This does look like a variant of https://bugzilla.redhat.com/show_bug.cgi?id=1767797. Would it be possible to identify if any instances have been unshelved to the affected host recently? You can validate this by running 'openstack server event list $SERVER_UUID' for each instance on the host. You can identify the instances on the host by running 'openstack server list --host $HOST'.
From comment #19: > So as we suspected, the reshelving of the instance broke the pci device mapping. We're going to work on this bug to prevent this kind of situation from happening in the future. This confirms that this BZ is a duplicate of 1767797, which is fixed in 17.0 (and due to backport complexity cannot be fixed in earlier releases). *** This bug has been marked as a duplicate of bug 1767797 ***