Bug 1678681 - REVERT_RESIZE stuck for 300s: "VirtualInterfaceCreateException: Virtual Interface creation failed
Summary: REVERT_RESIZE stuck for 300s: "VirtualInterfaceCreateException: Virtual Inter...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 15.0 (Stein)
Assignee: Artom Lifshitz
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
: 1694629 1694957 1710043 1716564 (view as bug list)
Depends On:
Blocks: 1730728
TreeView+ depends on / blocked
 
Reported: 2019-02-19 11:33 UTC by Filip Hubík
Modified: 2023-09-18 00:15 UTC (History)
18 users (show)

Fixed In Version: openstack-nova-19.0.2-0.20190718164748.950e044.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1730728 (view as bug list)
Environment:
Last Closed: 2019-09-21 11:20:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Changed logs from /var/log from compute-1 during failed test run (54.43 KB, application/gzip)
2019-02-19 11:33 UTC, Filip Hubík
no flags Details
tempest log of failed testcase test_resize_server_revert (556.34 KB, text/plain)
2019-02-19 11:34 UTC, Filip Hubík
no flags Details
Neutron and Nova logs from all services from the upstream gate, sorted and merged with os-log-merger (12.95 MB, application/gzip)
2019-04-10 18:17 UTC, Artom Lifshitz
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1832028 0 None None None 2019-06-07 18:00:53 UTC
OpenStack gerrit 644881 0 None MERGED Revert resize: wait for events according to hybrid plug 2021-01-21 15:00:17 UTC
OpenStack gerrit 667177 0 None MERGED Revert resize: wait for events according to hybrid plug 2021-01-21 15:00:17 UTC
Red Hat Product Errata RHEA-2019:2811 0 None None None 2019-09-21 11:20:55 UTC

Description Filip Hubík 2019-02-19 11:33:06 UTC
Created attachment 1536290 [details]
Changed logs from /var/log from compute-1 during failed test run

Description of problem:

Tempest test "tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert" (also test_resize_server_revert_with_volume_attached) fails frequently during REVERT_RESIZE operation. REVERT_RESIZE transitions to ERROR state after 300s timeout with error message reported (nova show or /var/log/containers/nova/nova-compute.log on compute node):

...
ERROR nova.compute.manager [req-1eee0290-b924-407b-b901-c248173c1542 c4940ef2d8e346dd91153046585e5c61 8eed26b767844e4bb03f561250fdb69d - default default] [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] Setting instance vm_state to ERROR: VirtualInterfaceCreateException: Virtual Interface creation failed
ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] Traceback (most recent call last):
ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7954, in _error_out_instance_on_exception
ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0]     yield
ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4151, in finish_revert_resize
ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0]     block_device_info, power_on)
ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8505, in finish_revert_migration
ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0]     power_on=power_on)
ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5648, in _create_domain_and_network
ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0]     raise exception.VirtualInterfaceCreateException()
ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] VirtualInterfaceCreateException: Virtual Interface creation failed
...

Neutron reports port is ACTIVE, until timeout is reached, then port is reported as DOWN alongside VM reaching ERROR state:

| 2abbba96-5239-4b26-810e-d949345472be |                                                 | fa:16:3e:cf:31:03 | ip_address='10.100.0.5', subnet_id='2061db16-c336-4243-be4f-e9435e521225'     | DOWN   |

$ openstack port show 2abbba96-5239-4b26-810e-d949345472be
+-----------------------+---------------------------------------------------------------------------+
| Field                 | Value                                                                     |
+-----------------------+---------------------------------------------------------------------------+
| admin_state_up        | UP                                                                        |
| allowed_address_pairs |                                                                           |
| binding_host_id       | compute-1.localdomain                                                     |
| binding_profile       |                                                                           |
| binding_vif_details   | datapath_type='system', ovs_hybrid_plug='True', port_filter='True'        |
| binding_vif_type      | ovs                                                                       |
| binding_vnic_type     | normal                                                                    |
| created_at            | 2019-02-19T11:09:56Z                                                      |
| data_plane_status     | None                                                                      |
| description           |                                                                           |
| device_id             | 2ab7a9d0-e99e-40da-a21e-4dc43928b842                                      |
| device_owner          | compute:nova                                                              |
| dns_assignment        | None                                                                      |
| dns_domain            | None                                                                      |
| dns_name              | None                                                                      |
| extra_dhcp_opts       |                                                                           |
| fixed_ips             | ip_address='10.100.0.5', subnet_id='2061db16-c336-4243-be4f-e9435e521225' |
| id                    | 2abbba96-5239-4b26-810e-d949345472be                                      |
| mac_address           | fa:16:3e:cf:31:03                                                         |
| name                  |                                                                           |
| network_id            | 13047326-e080-4361-9c96-56fe197f0e75                                      |
| port_security_enabled | True                                                                      |
| project_id            | 6481555f58c9415d8d0967ec7b52de93                                          |
| qos_policy_id         | None                                                                      |
| revision_number       | 13                                                                        |
| security_group_ids    | 030cada4-78f1-4390-82a1-706dccca84cf                                      |
| status                | DOWN                                                                      |
| tags                  |                                                                           |
| trunk_details         | None                                                                      |
| updated_at            | 2019-02-19T11:15:32Z                                                      |
+-----------------------+---------------------------------------------------------------------------+

On one of the controllers I can see error:

DEBUG neutron.plugins.ml2.plugin [req-2dffc726-8391-48fb-bffd-aee5e448b9f3 - - - - -] Port 2abbba96-5239-4b26-810e-d949345472be cannot update to ACTIVE because it is not bound. 

Version-Release number of selected component (if applicable):
OSP14, started to appear since puddle 2019-02-07.1

How reproducible:
I can see this issue in about 60-75% of cases, in both CI and manual deployments.
In CI this issue appears in deployments with 3 controllers, 2 computes + ceph backend, ipv4 (vlan, vxlan, gre)

Steps to Reproduce:
1. Deploy mentioned topologies using InfraRed with Tempest stage included
2. Wait Tempest to finish full test suite
3. Manually re-run one of test_resize_server_revert.* tests multiple times

Actual results:
In more than 50% of cases these tests fail

Expected results:
Tests should pass in 100% of cases

Additional info:
/var/log gathered during this test on compute-1 node attached (compute-1-failed-run.tar.gz)
tempest.log for failed test attached (tempest.resize_server_revert.fail.log)

Comment 1 Filip Hubík 2019-02-19 11:34:25 UTC
Created attachment 1536291 [details]
tempest log of failed testcase test_resize_server_revert

Comment 3 Artom Lifshitz 2019-02-20 19:41:38 UTC
Looking at this [1] failed test, I tracked the revert_resize request in Nova (req-22971baa-a410-4a82-912e-08b53baea63f) and saw this:

2019-02-14 18:55:57.209 [./compute-0/var/log/containers/nova/nova-compute.log] 1 WARNING nova.virt.libvirt.driver [req-22971baa-a410-4a82-912e-08b53baea63f 4bffcaca53e74db68d1958ba64224bec 11e189d140ce44b0b8151116b1bef61e - default default] [instance: 787d531e-fad8-481d-aec6-b9b3ddc2d4f6] Timeout waiting for [('network-vif-plugged', u'b3ada13d-46a6-40a3-a0a5-a4e33cde96ee')] for instance with vm_state resized and task_state resize_reverting.: Timeout: 300 seconds

Here's where Nova queries Neutron to update the port binding:

2019-02-14 18:50:51.138 [./compute-0/var/log/containers/nova/nova-compute.log] 1 INFO nova.network.neutronv2.api [req-22971baa-a410-4a82-912e-08b53baea63f 4bffcaca53e74db68d1958ba64224bec 11e189d140ce44b0b8151116b1bef61e - default default] [instance: 787d531e-fad8-481d-aec6-b9b3ddc2d4f6] Updating port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee with attributes {'device_owner': u'compute:nova', 'binding:host_id': u'compute-0.localdomain'}

This makes sense, it's reverting to compute-0, which is where the instance was originally booted.

And here's the request on the Neutron side:

./controller-0/var/log/containers/neutron/server.log.1:2019-02-14 18:50:52.966 31 INFO neutron.wsgi [req-4a7dcf73-0250-4e7b-98eb-87c2d1e5a22c a8b690b338164076b3ebdf77c73e1aea 2fb49df20cec443bb2cab70931c6f5f4 - default default] 172.17.1.21 "PUT /v2.0/ports/b3ada13d-46a6-40a3-a0a5-a4e33cde96ee HTTP/1.1" status: 200  len: 1144 time: 1.8267539

Request req-4a7dcf73-0250-4e7b-98eb-87c2d1e5a22c on the Neutron side completes successfully, but the external event back to Nova isn't sent using the original context, so we have to go by timestamp. Here's Neutron sending the event:

2019-02-14 18:50:54.225 33 DEBUG novaclient.v2.client [-] REQ: curl -g -i -X POST http://172.17.1.17:8774/v2.1/os-server-external-events -H "Accept: application/json" -H "Content-Type: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: {SHA1}482f0ed6723df329b563cada93afa53b8e7e0f85" -H "X-OpenStack-Nova-API-Version: 2.1" -d '{"events": [{"status": "completed", "tag": "b3ada13d-46a6-40a3-a0a5-a4e33cde96ee", "name": "network-vif-plugged", "server_uuid": "787d531e-fad8-481d-aec6-b9b3ddc2d4f6"}]}' _http_log_request /usr/lib/python2.7/site-packages/keystoneauth1/session.py:448

And here's Nova receiving it:

2019-02-14 18:50:54.239 21 DEBUG nova.api.openstack.wsgi [req-db5eec36-5a9b-4aa7-a912-12675341e615 d5269ffd668f49a1bd69346fb444c9d2 2fb49df20cec443bb2cab70931c6f5f4 - default default] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.compute.server_external_events.ServerExternalEventsController object at 0x7ff81a2c2210>>, body: {"events": [{"status": "completed", "tag": "b3ada13d-46a6-40a3-a0a5-a4e33cde96ee", "name": "network-vif-plugged", "server_uuid": "787d531e-fad8-481d-aec6-b9b3ddc2d4f6"}]} _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:615

The event then appears to get dispatched correctly to the computes:

2019-02-14 18:50:54.280 [./controller-0/var/log/containers/nova/nova-api.log] 21 DEBUG nova.compute.api [req-db5eec36-5a9b-4aa7-a912-12675341e615 d5269ffd668f49a1bd69346fb444c9d2 2fb49df20cec443bb2cab70931c6f5f4
 - default default] Instance 787d531e-fad8-481d-aec6-b9b3ddc2d4f6 is migrating, copying events to all relevant hosts: set([u'compute-0.localdomain', u'compute-1.localdomain']) _get_relevant_hosts /usr/lib/python
2.7/site-packages/nova/compute/api.py:4765

The external event is then ignored by both computes, for reasons that aren't clear to me yet:

2019-02-14 18:50:54.299 [./compute-0/var/log/containers/nova/nova-compute.log] 1 DEBUG nova.compute.manager [req-db5eec36-5a9b-4aa7-a912-12675341e615 d5269ffd668f49a1bd69346fb444c9d2 2fb49df20cec443bb2cab70931c6f5f4 - default default] [instance: 787d531e-fad8-481d-aec6-b9b3ddc2d4f6] No waiting events found dispatching network-vif-plugged-b3ada13d-46a6-40a3-a0a5-a4e33cde96ee pop_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:366

2019-02-14 18:50:54.300 [./compute-1/var/log/containers/nova/nova-compute.log] 1 DEBUG nova.compute.manager [req-db5eec36-5a9b-4aa7-a912-12675341e615 d5269ffd668f49a1bd69346fb444c9d2 2fb49df20cec443bb2cab70931c6f5f4 - default default] [instance: 787d531e-fad8-481d-aec6-b9b3ddc2d4f6] No waiting events found dispatching network-vif-plugged-b3ada13d-46a6-40a3-a0a5-a4e33cde96ee pop_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:366

I'll continue looking into this.

[1] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/ReleaseDelivery/view/OSP14/job/phase2-14_director-rhel-7.6-virthost-3cont_2comp-ipv4-vxlan-ceph-external-containers/88/testReport/tempest.api.compute.servers.test_server_actions/ServerActionsTestJSON/test_resize_server_revert_id_c03aab19_adb1_44f5_917d_c419577e9e68_/

Comment 4 Artom Lifshitz 2019-02-20 21:25:19 UTC
And here's the root cause: Nova starts listening for the even too late - in this case, 2 seconds *after* we've received it and thrown it away:

2019-02-14 18:50:56.385 [./compute-0/var/log/containers/nova/nova-compute.log] 1 DEBUG nova.compute.manager [req-22971baa-a410-4a82-912e-08b53baea63f 4bffcaca53e74db68d1958ba64224bec 11e189d140ce44b0b8151116b1bef61e - default default] [instance: 787d531e-fad8-481d-aec6-b9b3ddc2d4f6] Preparing to wait for external event network-vif-plugged-b3ada13d-46a6-40a3-a0a5-a4e33cde96ee prepare_for_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:328

So it's a legit Nova bug. I'll have to fix this upstream and backport.

Comment 5 Artom Lifshitz 2019-02-26 18:10:37 UTC
I've begun on a patch to fix this, but it's not going to be easy. We have to change wholesale how external events are handled, and there's a whole bunch of different places in the code that are affected.

Comment 6 Artom Lifshitz 2019-03-06 17:33:53 UTC
Here's some further analysis, in addition to what I said in comment 3 and comment 4.

The reason that Nova gets the network-vif-plugged event so early is that the vif remains plugged on the source host.

If allow_resize_to_same_host is True and the resize is done to the same host, the instance never leaves the "source", so it's mostly obvious that the VIFs remain plugged.

If allow_resize_to_same_host is False and the resize involves 2 compute hosts (as is the case in this BZ), when the instance reaches the VERIFY_RESIZE state there are actually 2 libvirt instances underneath: one on the source, in `shut off` state, one of the dest, in `paused` state. If the resize is reverted, the instance on the source is spawned again. Because the instance remains on the source (albeit shut off), I believe the VIFs remain plugged on the source in this case as well, but I'd like confirmation from Neutron folks.

Comment 7 Slawek Kaplonski 2019-03-07 14:19:26 UTC
Hi Artom,

I was looking into logs from this job also and I was investigating instance 787d531e-fad8-481d-aec6-b9b3ddc2d4f6 and port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee connected to it (the same as You in comment 3.

From neutron side all looks good here.
During resize, nova shutdowns vm so port is removed from compute-0. It is visible in neutron logs:

2019-02-14 18:50:43.627 35 DEBUG neutron.plugins.ml2.rpc [req-aed78ef5-2be4-4235-b631-163b64943623 - - - - -] Device b3ada13d-46a6-40a3-a0a5-a4e33cde96ee no longer exists at agent ovs-agent-compute-0.localdomain update_device_down /usr/lib/python2.7/site-packages/neutron/plugin
2019-02-14 18:50:43.710 35 DEBUG neutron.plugins.ml2.rpc [req-aed78ef5-2be4-4235-b631-163b64943623 - - - - -] Device b3ada13d-46a6-40a3-a0a5-a4e33cde96ee not bound to the agent host compute-0.localdomain update_device_down


There is also notification about that send to nova:

2019-02-14 18:50:40.303 30 DEBUG neutron.notifiers.nova [-] Sending events: [{'status': 'completed', 'tag': u'b3ada13d-46a6-40a3-a0a5-a4e33cde96ee', 'name': 'network-vif-unplugged', 'server_uuid': u'787d531e-fad8-481d-aec6-b9b3ddc2d4f6'}] send_events


Later port is even bound on compute-1, it is configured by L2 agent and notifications about that also are send to nova:

2019-02-14 18:50:46.219 47 DEBUG neutron.plugins.ml2.rpc [req-8d4fc78e-511d-4904-ac3b-dd8b69aafdc8 - - - - -] Device b3ada13d-46a6-40a3-a0a5-a4e33cde96ee up at agent ovs-agent-compute-1.localdomain update_device_up /usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:255

2019-02-14 18:50:46.298 47 DEBUG neutron.db.provisioning_blocks [req-8d4fc78e-511d-4904-ac3b-dd8b69aafdc8 - - - - -] Provisioning complete for port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee triggered by entity L2. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/pro

Notification send to nova:

2019-02-14 18:50:46.482 47 DEBUG neutron.notifiers.nova [-] Sending events: [{'status': 'completed', 'tag': u'b3ada13d-46a6-40a3-a0a5-a4e33cde96ee', 'name': 'network-vif-plugged', 'server_uuid': u'787d531e-fad8-481d-aec6-b9b3ddc2d4f6'}] send_events


And then resize is reverted, so instance is again created (or just spawned, I don't know exactly process on nova's side) on compute-0.

Nova then asks neutron to bind port on compute-0 again, at it happens:

2019-02-14 18:50:52.482 31 DEBUG neutron.plugins.ml2.managers [req-4a7dcf73-0250-4e7b-98eb-87c2d1e5a22c a8b690b338164076b3ebdf77c73e1aea 2fb49df20cec443bb2cab70931c6f5f4 - default default] Attempting to bind port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee on host compute-0.localdomain for vnic_type normal with profile  bind_port /usr/lib/python2.7/site-packages/neutro

And after that, L2 agent on compute-0 wire port again:

2019-02-14 18:50:53.968 33 DEBUG neutron.plugins.ml2.rpc [req-aed78ef5-2be4-4235-b631-163b64943623 - - - - -] Device b3ada13d-46a6-40a3-a0a5-a4e33cde96ee up at agent ovs-agent-compute-0.localdomain update_device_up /usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:255

And notification is then send to nova:

2019-02-14 18:50:54.044 33 DEBUG neutron.db.provisioning_blocks [req-aed78ef5-2be4-4235-b631-163b64943623 - - - - -] Provisioning for port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee completed by entity L2. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:132
2019-02-14 18:50:54.050 33 DEBUG neutron.db.provisioning_blocks [req-aed78ef5-2be4-4235-b631-163b64943623 - - - - -] Provisioning complete for port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee triggered by entity L2. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:138
...
2019-02-14 18:50:54.222 33 DEBUG neutron.notifiers.nova [-] Sending events: [{'status': 'completed', 'tag': u'b3ada13d-46a6-40a3-a0a5-a4e33cde96ee', 'name': 'network-vif-plugged', 'server_uuid': u'787d531e-fad8-481d-aec6-b9b3ddc2d4f6'}] send_events /usr/lib/python2.7/site-packages/neutron/notifiers/nova.py:242


So from neutron perspective all looks in this case exactly the same as it is when You spawn new instance.
I don't know nova code but maybe during resize revert nova is not going through this code path: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L5672 and that's why it's not waiting for this event from neutron?

Comment 8 Slawek Kaplonski 2019-03-07 14:44:51 UTC
From ovs agent logs I found that during resize port was all the time on compute-0:

2019-02-14 18:50:15.627 47296 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["353899e7-1c41-4f9c-a56d-7792c42db0e4","insert","qvob3ada13d-46",["set",[]],["map",[["attached-mac","fa:16:3e:68:bf:14"],["iface-id","b3ada13d-46a6-40a3-a0a5-a4e33cde96ee"],["iface-status","active"],["vm-uuid","787d531e-fad8-481d-aec6-b9b3ddc2d4f6"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:262
2019-02-14 18:50:15.632 47296 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["353899e7-1c41-4f9c-a56d-7792c42db0e4","old",null,["set",[]],null],["","new","qvob3ada13d-46",35,["map",[["attached-mac","fa:16:3e:68:bf:14"],["iface-id","b3ada13d-46a6-40a3-a0a5-a4e33cde96ee"],["iface-status","active"],["vm-uuid","787d531e-fad8-481d-aec6-b9b3ddc2d4f6"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:262
2019-02-14 18:55:57.487 47296 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["353899e7-1c41-4f9c-a56d-7792c42db0e4","delete","qvob3ada13d-46",35,["map",[["attached-mac","fa:16:3e:68:bf:14"],["iface-id","b3ada13d-46a6-40a3-a0a5-a4e33cde96ee"],["iface-status","active"],["vm-uuid","787d531e-fad8-481d-aec6-b9b3ddc2d4f6"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:262

Comment 9 Artom Lifshitz 2019-03-20 14:48:58 UTC
Posted a WIP for the new and hopefully final approach to fixing this, updated the link in the external tracker.

Comment 12 Artom Lifshitz 2019-04-04 20:44:33 UTC
*** Bug 1694957 has been marked as a duplicate of this bug. ***

Comment 13 Artom Lifshitz 2019-04-04 20:51:00 UTC
*** Bug 1694629 has been marked as a duplicate of this bug. ***

Comment 14 Artom Lifshitz 2019-04-10 18:16:09 UTC
The reason for the delay is that I'm trying to track down the cause of an upstream test failure [1] on the proposed fix. Specifically, the patch isn't passing the test_resize_server_revert tests because the instance fails the revert resize with a timeout waiting for the network-vif-plugged event.

The patch modifies Nova to start waiting for external events before we call any of the Neutron API methods in Nova (setup_networks_on_host and migrate_instance_finish) which would cause the port binding to be updated and the external event to be sent.

From what I can tell, on the Nova side of things, we go straight from "Preparing to wait for external events" to "Timeout waiting for external events", with nothing in between (formatted for legibility):

Mar 27 20:32:45.175033
[./logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/controller/logs/screen-n-cpu.txt.gz]
ubuntu-bionic-rax-ord-0004428291 nova-compute[19301]: DEBUG nova.compute.manager
[None req-35e5f0c9-9e63-4f35-ad59-bbd6aa71cf0c
 tempest-ServerActionsTestJSON-1612454868
 tempest-ServerActionsTestJSON-1612454868]
[instance: 9ce66ab0-abdf-40ce-8ab9-c719e61b4366]
Preparing to wait for external event network-vif-plugged-5340d74a-184b-4c08-8bd8-d41ca9d6f21e
{{(pid=19301) prepare_for_instance_event /opt/stack/nova/nova/compute/manager.py:326}}

Mar 27 20:37:45.258678
[./logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/controller/logs/screen-n-cpu.txt.gz]
ubuntu-bionic-rax-ord-0004428291 nova-compute[19301]: ERROR nova.compute.manager
[None req-35e5f0c9-9e63-4f35-ad59-bbd6aa71cf0c
 tempest-ServerActionsTestJSON-1612454868
 tempest-ServerActionsTestJSON-1612454868]
Timeout waiting for Neutron events [('network-vif-plugged', '5340d74a-184b-4c08-8bd8-d41ca9d6f21e')]

On the Neutron side of things, I see the port being *removed*, for reasons I don't understand:

Mar 27 20:32:46.349332
[./logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/controller/logs/screen-q-agt.txt.gz]
ubuntu-bionic-rax-ord-0004428291 neutron-openvswitch-agent[11536]:
INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
[None req-299c3b48-0da2-481d-a95b-585d1ae08e51 None None]
Ports {'5340d74a-184b-4c08-8bd8-d41ca9d6f21e'} removed

The external event is sent to Nova, leaving Nova understandably confused:

Mar 27 20:32:46.473200
[./logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/controller/logs/screen-q-svc.txt.gz]
ubuntu-bionic-rax-ord-0004428291 neutron-server[10259]: DEBUG neutron.notifiers.nova [-]
Sending events: [{'server_uuid': '9ce66ab0-abdf-40ce-8ab9-c719e61b4366',
                  'name': 'network-vif-unplugged',
                  'status': 'completed',
                  'tag': '5340d74a-184b-4c08-8bd8-d41ca9d6f21e'}]
{{(pid=10638) send_events /opt/stack/neutron/neutron/notifiers/nova.py:245}}

Mar 27 20:32:46.535510
[./logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/controller/logs/screen-n-cpu.txt.gz]
ubuntu-bionic-rax-ord-0004428291 nova-compute[19301]: WARNING nova.compute.manager
[req-c1e216aa-630d-4eb2-be77-5bfdeb70d9d2
 req-75bf2f97-15ba-4d8a-b610-82eb4fe51efd
 service nova]
[instance: 9ce66ab0-abdf-40ce-8ab9-c719e61b4366]
Received unexpected event network-vif-unplugged-5340d74a-184b-4c08-8bd8-d41ca9d6f21e
 for instance with vm_state resized and task_state resize_reverting.

Slaweq (or any other Neutron folks), could you take a look at the attached logs and see if you can shed light on why Neutron removes the port?

[1] http://logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/testr_results.html.gz

Comment 15 Artom Lifshitz 2019-04-10 18:17:14 UTC
Created attachment 1554341 [details]
Neutron and Nova logs from all services from the upstream gate, sorted and merged with os-log-merger

Comment 16 Artom Lifshitz 2019-04-11 20:01:00 UTC
After talking about this with Slawek, it appears the port being removed is a consequence of the initial resize. I've pushed a new DNM patch upstream with extra logs to try and indentify why Nova isn't sending any update port binding request to Neutron during the revert.

Comment 20 Francois Palin 2019-05-23 17:46:22 UTC
*** Bug 1710043 has been marked as a duplicate of this bug. ***

Comment 21 Michele Baldessari 2019-06-04 14:18:13 UTC
*** Bug 1716564 has been marked as a duplicate of this bug. ***

Comment 25 Patrick Oberdorf 2019-06-21 17:03:31 UTC
Hello,

we are facing the exact same problem. Is there already any fix out there?

Comment 26 Brian 2019-06-21 19:26:57 UTC
the fix is in upstream review https://review.opendev.org/#/c/644881/

Comment 29 Joe H. Rahme 2019-08-05 17:39:02 UTC
(overcloud) [stack@undercloud-0 tempestdir]$ tempest run --regex test_resize_server_revert
{0} tempest.api.compute.admin.test_migrations.MigrationsAdminTest.test_resize_server_revert_deleted_flavor [83.179225s] ... ok
{1} tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert [47.407996s] ... ok
{1} tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert_with_volume_attached [71.966658s] ... ok

======
Totals
======
Ran: 3 tests in 159.9172 sec.
 - Passed: 3
 - Skipped: 0
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 202.5539 sec.

==============
Worker Balance
==============
 - Worker 0 (1 tests) => 0:01:23.179225
 - Worker 1 (2 tests) => 0:01:59.379387

Comment 32 errata-xmlrpc 2019-09-21 11:20:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811

Comment 33 Red Hat Bugzilla 2023-09-18 00:15:29 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.