Bug 1911710 - [osp 13][neutron] cold migration fails for sriov instance with: Port update failed for port <uuid>c: Unable to correlate PCI slot
Summary: [osp 13][neutron] cold migration fails for sriov instance with: Port update f...
Keywords:
Status: CLOSED DUPLICATE of bug 1767797
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-30 20:04 UTC by Matt Flusche
Modified: 2023-08-08 15:23 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-13 05:44:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-1756 0 None None None 2021-11-23 20:24:50 UTC

Description Matt Flusche 2020-12-30 20:04:35 UTC
Description of problem:

cold migration fails for sriov instance when the VF port's PCI location is in use on destination host.  It seems that VF location is not re-calculated on destination during migration.

Error from nova log:

2020-12-29 12:27:27.921 8 ERROR nova.compute.manager [req-XXX XXX XXX - default default] [instance:
 XXX ] Setting instance vm_state to ERROR: PortUpdateFailed: Port update failed for port XXX: Unable to correlate PCI slot 00
00:af:1c.5

Version-Release number of selected component (if applicable):
OSP 13

openstack-neutron-ml2-12.1.1-35.1.el7ost.noarch
openstack-neutron-openvswitch-12.1.1-35.1.el7ost.noarch
openstack-neutron-12.1.1-35.1.el7ost.noarch
openstack-neutron-fwaas-12.0.2-1.el7ost.noarch
openstack-neutron-common-12.1.1-35.1.el7ost.noarch
openstack-neutron-lbaas-12.0.1-0.20190803015156.b86fcef.el7ost.noarch

openstack-nova-common-17.0.13-27.el7ost.noarch

How reproducible:
100% in this environment where destination VF pci location is in use.

Steps to Reproduce:
1. stop VM 
2. cold migrate VM 
3.


Additional info:

Perhaps related to these but wanted to open a new bz to verify:

https://bugzilla.redhat.com/show_bug.cgi?id=1767797
https://bugzilla.redhat.com/show_bug.cgi?id=1852110

I'll provide env specific details and logs as private attachments for review.

Comment 10 Stephen Finucane 2021-01-05 10:57:21 UTC
This does look like a variant of https://bugzilla.redhat.com/show_bug.cgi?id=1767797. Would it be possible to identify if any instances have been unshelved to the affected host recently? You can validate this by running 'openstack server event list $SERVER_UUID' for each instance on the host. You can identify the instances on the host by running 'openstack server list --host $HOST'.

Comment 32 Artom Lifshitz 2023-07-13 05:44:08 UTC
From comment #19:

> So as we suspected, the reshelving of the instance broke the pci device mapping. We're going to work on this bug to prevent this kind of situation from happening in the future.

This confirms that this BZ is a duplicate of 1767797, which is fixed in 17.0 (and due to backport complexity cannot be fixed in earlier releases).

*** This bug has been marked as a duplicate of bug 1767797 ***


Note You need to log in before you can comment on or make changes to this bug.