Description of problem:
cold migration fails for sriov instance when the VF port's PCI location is in use on destination host. It seems that VF location is not re-calculated on destination during migration.
Error from nova log:
2020-12-29 12:27:27.921 8 ERROR nova.compute.manager [req-XXX XXX XXX - default default] [instance:
XXX ] Setting instance vm_state to ERROR: PortUpdateFailed: Port update failed for port XXX: Unable to correlate PCI slot 00
00:af:1c.5
Version-Release number of selected component (if applicable):
OSP 13
openstack-neutron-ml2-12.1.1-35.1.el7ost.noarch
openstack-neutron-openvswitch-12.1.1-35.1.el7ost.noarch
openstack-neutron-12.1.1-35.1.el7ost.noarch
openstack-neutron-fwaas-12.0.2-1.el7ost.noarch
openstack-neutron-common-12.1.1-35.1.el7ost.noarch
openstack-neutron-lbaas-12.0.1-0.20190803015156.b86fcef.el7ost.noarch
openstack-nova-common-17.0.13-27.el7ost.noarch
How reproducible:
100% in this environment where destination VF pci location is in use.
Steps to Reproduce:
1. stop VM
2. cold migrate VM
3.
Additional info:
Perhaps related to these but wanted to open a new bz to verify:
https://bugzilla.redhat.com/show_bug.cgi?id=1767797https://bugzilla.redhat.com/show_bug.cgi?id=1852110
I'll provide env specific details and logs as private attachments for review.
Comment 10Stephen Finucane
2021-01-05 10:57:21 UTC
This does look like a variant of https://bugzilla.redhat.com/show_bug.cgi?id=1767797. Would it be possible to identify if any instances have been unshelved to the affected host recently? You can validate this by running 'openstack server event list $SERVER_UUID' for each instance on the host. You can identify the instances on the host by running 'openstack server list --host $HOST'.
From comment #19:
> So as we suspected, the reshelving of the instance broke the pci device mapping. We're going to work on this bug to prevent this kind of situation from happening in the future.
This confirms that this BZ is a duplicate of 1767797, which is fixed in 17.0 (and due to backport complexity cannot be fixed in earlier releases).
*** This bug has been marked as a duplicate of bug 1767797 ***