Bug 1911710

Summary: [osp 13][neutron] cold migration fails for sriov instance with: Port update failed for port <uuid>c: Unable to correlate PCI slot
Product: Red Hat OpenStack Reporter: Matt Flusche <mflusche>
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED DUPLICATE QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: alifshit, alink, bcafarel, bshephar, chrisw, dasmith, dhill, eglynn, hakhande, jhakimra, jparker, kchamart, osp-dfg-compute, ravsingh, sbauza, scohen, sgordon, smooney, ssigwald, stephenfin, vromanso
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-13 05:44:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matt Flusche 2020-12-30 20:04:35 UTC
Description of problem:

cold migration fails for sriov instance when the VF port's PCI location is in use on destination host.  It seems that VF location is not re-calculated on destination during migration.

Error from nova log:

2020-12-29 12:27:27.921 8 ERROR nova.compute.manager [req-XXX XXX XXX - default default] [instance:
 XXX ] Setting instance vm_state to ERROR: PortUpdateFailed: Port update failed for port XXX: Unable to correlate PCI slot 00
00:af:1c.5

Version-Release number of selected component (if applicable):
OSP 13

openstack-neutron-ml2-12.1.1-35.1.el7ost.noarch
openstack-neutron-openvswitch-12.1.1-35.1.el7ost.noarch
openstack-neutron-12.1.1-35.1.el7ost.noarch
openstack-neutron-fwaas-12.0.2-1.el7ost.noarch
openstack-neutron-common-12.1.1-35.1.el7ost.noarch
openstack-neutron-lbaas-12.0.1-0.20190803015156.b86fcef.el7ost.noarch

openstack-nova-common-17.0.13-27.el7ost.noarch

How reproducible:
100% in this environment where destination VF pci location is in use.

Steps to Reproduce:
1. stop VM 
2. cold migrate VM 
3.


Additional info:

Perhaps related to these but wanted to open a new bz to verify:

https://bugzilla.redhat.com/show_bug.cgi?id=1767797
https://bugzilla.redhat.com/show_bug.cgi?id=1852110

I'll provide env specific details and logs as private attachments for review.

Comment 10 Stephen Finucane 2021-01-05 10:57:21 UTC
This does look like a variant of https://bugzilla.redhat.com/show_bug.cgi?id=1767797. Would it be possible to identify if any instances have been unshelved to the affected host recently? You can validate this by running 'openstack server event list $SERVER_UUID' for each instance on the host. You can identify the instances on the host by running 'openstack server list --host $HOST'.

Comment 32 Artom Lifshitz 2023-07-13 05:44:08 UTC
From comment #19:

> So as we suspected, the reshelving of the instance broke the pci device mapping. We're going to work on this bug to prevent this kind of situation from happening in the future.

This confirms that this BZ is a duplicate of 1767797, which is fixed in 17.0 (and due to backport complexity cannot be fixed in earlier releases).

*** This bug has been marked as a duplicate of bug 1767797 ***