Basically, one host was down, they nova host-evacuate a few times and have multiple errors. This is the current error: ~~~ Error: Port update failed for port 23e37c39-9ae7-481f-9bcf-7c725f7ae3b4: Unable to correlate PCI slot 0000:3b:00.3 ~~~ [1] Full traceback Here we can see the targeted pci addresses: mysql -N -s -h 172.18.0.97 -u root --password=q1w2e3 -D nova -e "select migration_context from instances a left join instance_extra b on a.uuid = b.instance_uuid where a.uuid = '8c3dc669-2032-4d46-9011-7d6bb14fc1c8'" | sed 's/\\\\/\\/g' | jq -C '."nova_object.data"."new_pci_devices"."nova_object.data".objects[]."nova_object.data" | .address' "0000:3b:01.6" "0000:3b:00.6" "0000:3b:01.5" "0000:3b:00.5" "0000:3b:02.1" "0000:3b:01.1" "0000:3b:02.0" "0000:3b:01.0" "0000:3b:01.7" "0000:3b:00.7" And this is the original pci_addresses: $ mysql -N -s -h 172.18.0.97 -u root --password=q1w2e3 -D nova -e "select migration_context from instances a left join instance_extra b on a.uuid = b.instance_uuid where a.uuid = '8c3dc669-2032-4d46-9011-7d6bb14fc1c8'" | sed 's/\\\\/\\/g' | jq -C '."nova_object.data"."old_pci_devices"."nova_object.data".objects[]."nova_object.data" | .address' "0000:3b:00.5" "0000:3b:00.6" "0000:3b:00.7" "0000:3b:01.0" "0000:3b:01.1" "0000:3b:01.5" "0000:3b:01.6" "0000:3b:01.7" "0000:3b:02.0" "0000:3b:02.1" Customer confirmed that they didn't touch that port, except for binding_host, so I'm wondering where that 0000:3b:00.3 comes from. They did multiple attempts at evacuating the host, so that pci address might come from one of those attempt. Versions: openstack-nova-common-17.0.12-1.el7ost.noarch [1] ~~~ 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server [req-ca7443aa-1e20-49cb-a531-94f449cae6ff af0945d62ee34837b02cf42df3d7b157 40eb0ee055d343dab188003e2593beb6 - default default] Exception during message handling: PortUpdateFailed: Port update failed for port 23e37c39-9ae7-481f-9bcf-7c725f7ae3b4: Unable to correlate PCI slot 0000:3b:00.3 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 229, in inner 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server return func(*args, **kwargs) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 76, in wrapped 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server function_name, call_dict, binary) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server self.force_reraise() 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 67, in wrapped 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 187, in decorated_function 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server "Error: %s", e, instance=instance) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server self.force_reraise() 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 157, in decorated_function 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/utils.py", line 1021, in decorated_function 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 215, in decorated_function 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server kwargs['instance'], e, sys.exc_info()) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server self.force_reraise() 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 203, in decorated_function 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2958, in rebuild_instance 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server migration, request_spec) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3020, in _do_rebuild_instance_with_claim 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server self._do_rebuild_instance(*args, **kwargs) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3114, in _do_rebuild_instance 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server context, instance, self.host, migration) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 2819, in setup_instance_network_on_host 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server migration) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 2890, in _update_port_binding_for_instance 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server pci_slot) 2020-06-28 20:12:19.812 8 ERROR oslo_messaging.rpc.server PortUpdateFailed: Port update failed for port 23e37c39-9ae7-481f-9bcf-7c725f7ae3b4: Unable to correlate PCI slot 0000:3b:00.3 ~~~
I'm not sure how it was before the host evacuation but we now have multiple ports using the same pci_addresses [1][2] They manually updated the binding_profile of one of the instance with free pci_addresses, the instance launched and everything was working, except for the fact that the ports were stuck in BUILD status [3] and the pci_devices table isn't in sync with the port binding [4]. They updated the binding host for these ports and they flipped ACTIVE immediately and everything works for this instance. [1] ~~~ $ mysql -t -h 172.18.0.97 -u root --password=q1w2e3 -D ovs_neutron -e "select a.id,a.status,a.admin_state_up,b.host,b.profile,vif_details,a.network_id from ports a left join ml2_port_bindings b on a.id = b.port_id where a.device_id = '8c3dc669-2032-4d46-9011-7d6bb14fc1c8' order by profile desc;" +--------------------------------------+--------+----------------+-------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------+--------------------------------------+ | id | status | admin_state_up | host | profile | vif_details | network_id | +--------------------------------------+--------+----------------+-------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------+--------------------------------------+ | 1afbe85e-2d9d-4dd3-8529-5f3ae90e2061 | DOWN | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:02.1", "physical_network": "provider2", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "2000"} | e6693c26-791e-4d84-aa4f-c2661ce92d13 | | 384d021d-ee74-4494-901a-f3cfd7bc5e55 | DOWN | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:02.1", "physical_network": "provider2", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "3700"} | 42cc718c-18a4-421b-a89a-3d590a3461fd | | 05519895-011b-407c-80ec-6c3ce5740db9 | DOWN | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:02.1", "physical_network": "provider2", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "2000"} | e6693c26-791e-4d84-aa4f-c2661ce92d13 | | 2e68fa9e-b912-4161-b3d1-3c63f23c0de4 | DOWN | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:01.3", "physical_network": "provider2", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "2000"} | e6693c26-791e-4d84-aa4f-c2661ce92d13 | | a5741ad6-1585-4346-88ac-7cf0d9b45045 | DOWN | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:01.3", "physical_network": "provider2", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "3701"} | 1e2f9790-7c47-4f35-a493-5be2af81725d | | 2cf9de30-f198-4d68-9057-c5abc12013d2 | DOWN | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:01.0", "physical_network": "provider4", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "2000"} | 14dbac04-af0c-44f8-a32e-7608d4dd9ab4 | | e92d6dde-b57f-411d-819d-56791018e990 | DOWN | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:01.0", "physical_network": "provider4", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "3702"} | fbafa52a-9b4e-477e-9f03-093cbd2079f3 | | 23e37c39-9ae7-481f-9bcf-7c725f7ae3b4 | DOWN | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:00.3", "physical_network": "provider4", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "2000"} | 14dbac04-af0c-44f8-a32e-7608d4dd9ab4 | | 42a05ae8-31dc-4bf3-bc14-3110c3c9e6ba | DOWN | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:00.3", "physical_network": "provider4", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "3700"} | 445cb50a-61e5-407c-a78a-fc8b6b62330e | | 42d5c179-e964-41ef-baf5-2a3571d0a875 | DOWN | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:00.3", "physical_network": "provider4", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "2000"} | 14dbac04-af0c-44f8-a32e-7608d4dd9ab4 | +--------------------------------------+--------+----------------+-------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------+--------------------------------------+ ~~~ [2] ~~~ $ mysql -N -s -h 172.18.0.97 -u root --password=q1w2e3 -D nova_api -e "select spec from request_specs where instance_uuid = '8c3dc669-2032-4d46-9011-7d6bb14fc1c8'" | sed 's/\\\\/\\/g' | jq -C '."nova_object.data".pci_requests."nova_object.data".requests[] | select(."nova_object.name"=="InstancePCIRequest") | ."nova_object.data".spec[].physical_network' "provider2" "provider4" "provider2" "provider4" "provider2" "provider4" "provider2" "provider4" "provider2" "provider4" ~~~ [3] ~~~ $ mysql -t -h 172.18.0.97 -u root --password=q1w2e3 -D ovs_neutron -e "select a.id,a.status,a.admin_state_up,b.host,b.profile,vif_details,a.network_id from ports a left join ml2_port_bindings b on a.id = b.port_id where a.device_id = 'e1c2cd11-7acb-4147-9ec9-ce266620cf38' order by profile desc;" | sed 's/\.oss.timbrasil.com.br/.customer.com/g' +--------------------------------------+--------+----------------+-------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------+--------------------------------------+ | id | status | admin_state_up | host | profile | vif_details | network_id | +--------------------------------------+--------+----------------+-------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------+--------------------------------------+ | 8c18b2aa-a64f-4bfe-bc12-d75d37039f0a | BUILD | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:02.1", "physical_network": "provider2", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "2001"} | f9934f0a-1434-4d91-aca0-54f3647e8c33 | | 1c3ac913-820f-4272-8c2f-b68a90226d1b | BUILD | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:02.0", "physical_network": "provider2", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "3703"} | 63da6a54-a50a-4ead-9fcb-683928110aa2 | | e2099bd9-83f9-4a0a-95e7-90b4559f06e2 | BUILD | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:01.7", "physical_network": "provider2", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "2001"} | f9934f0a-1434-4d91-aca0-54f3647e8c33 | | d998f358-8a5e-4e4d-bdfa-ce6a3fc98971 | BUILD | 1 | lab01csrkhw012.customer.com | {"pci_slot": "0000:3b:00.7", "physical_network": "provider4", "trusted": "true", "pci_vendor_info": "15b3:1016"} | {"port_filter": false, "vlan": "3703"} | 5763db35-5728-4549-839f-4ffd4563ad95 | +--------------------------------------+--------+----------------+-------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------+--------------------------------------+ ~~~ [4] ~~~ select * from pci_devices where instance_uuid = 'e1c2cd11-7acb-4147-9ec9-ce266620cf38'; +---------------------+---------------------+------------+---------+------+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+--------------------------------------+-----------+--------------+--------------------------------------+ | created_at | updated_at | deleted_at | deleted | id | compute_node_id | address | product_id | vendor_id | dev_type | dev_id | label | status | extra_info | instance_uuid | request_id | numa_node | parent_addr | uuid | +---------------------+---------------------+------------+---------+------+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+--------------------------------------+-----------+--------------+--------------------------------------+ | 2020-04-01 05:12:32 | 2020-06-26 21:09:51 | NULL | 0 | 1486 | 49 | 0000:3b:00.4 | 1016 | 15b3 | type-VF | pci_0000_3b_00_4 | label_15b3_1016 | allocated | {"capabilities": "{\"network\": [\"rx\", \"tx\", \"sg\", \"tso\", \"gso\", \"gro\", \"rxvlan\", \"txvlan\", \"rxhash\", \"rdma\"]}"} | e1c2cd11-7acb-4147-9ec9-ce266620cf38 | cae21ab4-a126-48a4-be8d-d7556d274986 | 0 | 0000:3b:00.0 | 81568c2b-50ff-4d23-a9bd-e3732d569baf | | 2020-04-01 05:12:32 | 2020-06-26 21:09:51 | NULL | 0 | 1504 | 49 | 0000:3b:01.2 | 1016 | 15b3 | type-VF | pci_0000_3b_01_2 | label_15b3_1016 | allocated | {"capabilities": "{\"network\": [\"rx\", \"tx\", \"sg\", \"tso\", \"gso\", \"gro\", \"rxvlan\", \"txvlan\", \"rxhash\", \"rdma\"]}"} | e1c2cd11-7acb-4147-9ec9-ce266620cf38 | 38a20c4e-568b-483a-8eaa-d91fc10d0365 | 0 | 0000:3b:00.1 | e340903f-2162-49ab-a15b-e99b25db44b5 | | 2020-04-01 05:12:32 | 2020-06-26 21:09:51 | NULL | 0 | 1507 | 49 | 0000:3b:01.3 | 1016 | 15b3 | type-VF | pci_0000_3b_01_3 | label_15b3_1016 | allocated | {"capabilities": "{\"network\": [\"rx\", \"tx\", \"sg\", \"tso\", \"gso\", \"gro\", \"rxvlan\", \"txvlan\", \"rxhash\", \"rdma\"]}"} | e1c2cd11-7acb-4147-9ec9-ce266620cf38 | 0eab77ea-5ba7-4b29-9e02-1d56d1716397 | 0 | 0000:3b:00.1 | 7c8c5699-f4eb-4fdd-887c-f115475fa75c | | 2020-04-01 05:12:32 | 2020-06-26 21:09:51 | NULL | 0 | 1510 | 49 | 0000:3b:01.4 | 1016 | 15b3 | type-VF | pci_0000_3b_01_4 | label_15b3_1016 | allocated | {"capabilities": "{\"network\": [\"rx\", \"tx\", \"sg\", \"tso\", \"gso\", \"gro\", \"rxvlan\", \"txvlan\", \"rxhash\", \"rdma\"]}"} | e1c2cd11-7acb-4147-9ec9-ce266620cf38 | f26fc071-3e8a-4438-9e58-d4dbbd01568e | 0 | 0000:3b:00.1 | 2cdfb40f-d794-4347-8af9-e6e9fe6f163a | +---------------------+---------------------+------------+---------+------+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+--------------------------------------+-----------+--------------+--------------------------------------+ ~~~
i think this has a similar root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1767797 (When unshelving an SR-IOV instance, the binding profile isn't reclaimed or rescheduled, and this might cause PCI-PT conflicts) im not particular surprised. i know that this was broken for cold migrate in the past and host-evacuate is just doing cold migration so 13 porably does not have the fix. it looks quite similar to https://bugs.launchpad.net/nova/+bug/1658070 which was fiked in pike/osp 12 https://review.opendev.org/#/c/466143/ or this https://github.com/openstack/nova/commit/b930336854bffec1bb81b6d67079a4df59e0af19 which was develpoed in queens/osp 13 to resolve https://bugs.launchpad.net/nova/+bug/1703629 (Evacuation fails for instances with PCI devices due to missing migration) or https://bugs.launchpad.net/nova/+bug/1630698 (nova evacuate of instances with sriov ports fails due to use of source device) all of the above should be already fixed in 13 but i guess ill add this to the list of way that sriov is broken. its posible that https://bugs.launchpad.net/nova/+bug/1860555 is related and i do think we fixed a differen iss ue a few release ago that may not be backported to queens but right no im not aware of a specific patch that addresses it.
This is still in our backlog but it has also been reported upstream and a repoducer functional test created so updated the bz with both links
Let's stick to the newly opened BZ 2002243 for this, everything from comment #5 belongs there. Please stop updating here, it's needlessly confusing, the two situations are not identical.
*** Bug 2044754 has been marked as a duplicate of this bug. ***
This BZ is almost 2 years old, and is set as urgent/high. Can you provide a status update?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days