Bug 1678681
Summary: | REVERT_RESIZE stuck for 300s: "VirtualInterfaceCreateException: Virtual Interface creation failed | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Filip Hubík <fhubik> | |
Component: | openstack-nova | Assignee: | Artom Lifshitz <alifshit> | |
Status: | CLOSED ERRATA | QA Contact: | OSP DFG:Compute <osp-dfg-compute> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 15.0 (Stein) | CC: | alifshit, amodi, bbowen, dasmith, jhakimra, jschluet, kchamart, lyarwood, mbooth, michele, mpryc, njohnston, p.oberdorf, rrasouli, sbauza, sgordon, skaplons, vromanso | |
Target Milestone: | rc | Keywords: | Automation, AutomationBlocker, Patch, Triaged, ZStream | |
Target Release: | 15.0 (Stein) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-nova-19.0.2-0.20190718164748.950e044.el8ost | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1730728 (view as bug list) | Environment: | ||
Last Closed: | 2019-09-21 11:20:27 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1730728 | |||
Attachments: |
Created attachment 1536291 [details]
tempest log of failed testcase test_resize_server_revert
Looking at this [1] failed test, I tracked the revert_resize request in Nova (req-22971baa-a410-4a82-912e-08b53baea63f) and saw this: 2019-02-14 18:55:57.209 [./compute-0/var/log/containers/nova/nova-compute.log] 1 WARNING nova.virt.libvirt.driver [req-22971baa-a410-4a82-912e-08b53baea63f 4bffcaca53e74db68d1958ba64224bec 11e189d140ce44b0b8151116b1bef61e - default default] [instance: 787d531e-fad8-481d-aec6-b9b3ddc2d4f6] Timeout waiting for [('network-vif-plugged', u'b3ada13d-46a6-40a3-a0a5-a4e33cde96ee')] for instance with vm_state resized and task_state resize_reverting.: Timeout: 300 seconds Here's where Nova queries Neutron to update the port binding: 2019-02-14 18:50:51.138 [./compute-0/var/log/containers/nova/nova-compute.log] 1 INFO nova.network.neutronv2.api [req-22971baa-a410-4a82-912e-08b53baea63f 4bffcaca53e74db68d1958ba64224bec 11e189d140ce44b0b8151116b1bef61e - default default] [instance: 787d531e-fad8-481d-aec6-b9b3ddc2d4f6] Updating port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee with attributes {'device_owner': u'compute:nova', 'binding:host_id': u'compute-0.localdomain'} This makes sense, it's reverting to compute-0, which is where the instance was originally booted. And here's the request on the Neutron side: ./controller-0/var/log/containers/neutron/server.log.1:2019-02-14 18:50:52.966 31 INFO neutron.wsgi [req-4a7dcf73-0250-4e7b-98eb-87c2d1e5a22c a8b690b338164076b3ebdf77c73e1aea 2fb49df20cec443bb2cab70931c6f5f4 - default default] 172.17.1.21 "PUT /v2.0/ports/b3ada13d-46a6-40a3-a0a5-a4e33cde96ee HTTP/1.1" status: 200 len: 1144 time: 1.8267539 Request req-4a7dcf73-0250-4e7b-98eb-87c2d1e5a22c on the Neutron side completes successfully, but the external event back to Nova isn't sent using the original context, so we have to go by timestamp. Here's Neutron sending the event: 2019-02-14 18:50:54.225 33 DEBUG novaclient.v2.client [-] REQ: curl -g -i -X POST http://172.17.1.17:8774/v2.1/os-server-external-events -H "Accept: application/json" -H "Content-Type: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: {SHA1}482f0ed6723df329b563cada93afa53b8e7e0f85" -H "X-OpenStack-Nova-API-Version: 2.1" -d '{"events": [{"status": "completed", "tag": "b3ada13d-46a6-40a3-a0a5-a4e33cde96ee", "name": "network-vif-plugged", "server_uuid": "787d531e-fad8-481d-aec6-b9b3ddc2d4f6"}]}' _http_log_request /usr/lib/python2.7/site-packages/keystoneauth1/session.py:448 And here's Nova receiving it: 2019-02-14 18:50:54.239 21 DEBUG nova.api.openstack.wsgi [req-db5eec36-5a9b-4aa7-a912-12675341e615 d5269ffd668f49a1bd69346fb444c9d2 2fb49df20cec443bb2cab70931c6f5f4 - default default] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.compute.server_external_events.ServerExternalEventsController object at 0x7ff81a2c2210>>, body: {"events": [{"status": "completed", "tag": "b3ada13d-46a6-40a3-a0a5-a4e33cde96ee", "name": "network-vif-plugged", "server_uuid": "787d531e-fad8-481d-aec6-b9b3ddc2d4f6"}]} _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:615 The event then appears to get dispatched correctly to the computes: 2019-02-14 18:50:54.280 [./controller-0/var/log/containers/nova/nova-api.log] 21 DEBUG nova.compute.api [req-db5eec36-5a9b-4aa7-a912-12675341e615 d5269ffd668f49a1bd69346fb444c9d2 2fb49df20cec443bb2cab70931c6f5f4 - default default] Instance 787d531e-fad8-481d-aec6-b9b3ddc2d4f6 is migrating, copying events to all relevant hosts: set([u'compute-0.localdomain', u'compute-1.localdomain']) _get_relevant_hosts /usr/lib/python 2.7/site-packages/nova/compute/api.py:4765 The external event is then ignored by both computes, for reasons that aren't clear to me yet: 2019-02-14 18:50:54.299 [./compute-0/var/log/containers/nova/nova-compute.log] 1 DEBUG nova.compute.manager [req-db5eec36-5a9b-4aa7-a912-12675341e615 d5269ffd668f49a1bd69346fb444c9d2 2fb49df20cec443bb2cab70931c6f5f4 - default default] [instance: 787d531e-fad8-481d-aec6-b9b3ddc2d4f6] No waiting events found dispatching network-vif-plugged-b3ada13d-46a6-40a3-a0a5-a4e33cde96ee pop_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:366 2019-02-14 18:50:54.300 [./compute-1/var/log/containers/nova/nova-compute.log] 1 DEBUG nova.compute.manager [req-db5eec36-5a9b-4aa7-a912-12675341e615 d5269ffd668f49a1bd69346fb444c9d2 2fb49df20cec443bb2cab70931c6f5f4 - default default] [instance: 787d531e-fad8-481d-aec6-b9b3ddc2d4f6] No waiting events found dispatching network-vif-plugged-b3ada13d-46a6-40a3-a0a5-a4e33cde96ee pop_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:366 I'll continue looking into this. [1] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/ReleaseDelivery/view/OSP14/job/phase2-14_director-rhel-7.6-virthost-3cont_2comp-ipv4-vxlan-ceph-external-containers/88/testReport/tempest.api.compute.servers.test_server_actions/ServerActionsTestJSON/test_resize_server_revert_id_c03aab19_adb1_44f5_917d_c419577e9e68_/ And here's the root cause: Nova starts listening for the even too late - in this case, 2 seconds *after* we've received it and thrown it away: 2019-02-14 18:50:56.385 [./compute-0/var/log/containers/nova/nova-compute.log] 1 DEBUG nova.compute.manager [req-22971baa-a410-4a82-912e-08b53baea63f 4bffcaca53e74db68d1958ba64224bec 11e189d140ce44b0b8151116b1bef61e - default default] [instance: 787d531e-fad8-481d-aec6-b9b3ddc2d4f6] Preparing to wait for external event network-vif-plugged-b3ada13d-46a6-40a3-a0a5-a4e33cde96ee prepare_for_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:328 So it's a legit Nova bug. I'll have to fix this upstream and backport. I've begun on a patch to fix this, but it's not going to be easy. We have to change wholesale how external events are handled, and there's a whole bunch of different places in the code that are affected. Here's some further analysis, in addition to what I said in comment 3 and comment 4. The reason that Nova gets the network-vif-plugged event so early is that the vif remains plugged on the source host. If allow_resize_to_same_host is True and the resize is done to the same host, the instance never leaves the "source", so it's mostly obvious that the VIFs remain plugged. If allow_resize_to_same_host is False and the resize involves 2 compute hosts (as is the case in this BZ), when the instance reaches the VERIFY_RESIZE state there are actually 2 libvirt instances underneath: one on the source, in `shut off` state, one of the dest, in `paused` state. If the resize is reverted, the instance on the source is spawned again. Because the instance remains on the source (albeit shut off), I believe the VIFs remain plugged on the source in this case as well, but I'd like confirmation from Neutron folks. Hi Artom, I was looking into logs from this job also and I was investigating instance 787d531e-fad8-481d-aec6-b9b3ddc2d4f6 and port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee connected to it (the same as You in comment 3. From neutron side all looks good here. During resize, nova shutdowns vm so port is removed from compute-0. It is visible in neutron logs: 2019-02-14 18:50:43.627 35 DEBUG neutron.plugins.ml2.rpc [req-aed78ef5-2be4-4235-b631-163b64943623 - - - - -] Device b3ada13d-46a6-40a3-a0a5-a4e33cde96ee no longer exists at agent ovs-agent-compute-0.localdomain update_device_down /usr/lib/python2.7/site-packages/neutron/plugin 2019-02-14 18:50:43.710 35 DEBUG neutron.plugins.ml2.rpc [req-aed78ef5-2be4-4235-b631-163b64943623 - - - - -] Device b3ada13d-46a6-40a3-a0a5-a4e33cde96ee not bound to the agent host compute-0.localdomain update_device_down There is also notification about that send to nova: 2019-02-14 18:50:40.303 30 DEBUG neutron.notifiers.nova [-] Sending events: [{'status': 'completed', 'tag': u'b3ada13d-46a6-40a3-a0a5-a4e33cde96ee', 'name': 'network-vif-unplugged', 'server_uuid': u'787d531e-fad8-481d-aec6-b9b3ddc2d4f6'}] send_events Later port is even bound on compute-1, it is configured by L2 agent and notifications about that also are send to nova: 2019-02-14 18:50:46.219 47 DEBUG neutron.plugins.ml2.rpc [req-8d4fc78e-511d-4904-ac3b-dd8b69aafdc8 - - - - -] Device b3ada13d-46a6-40a3-a0a5-a4e33cde96ee up at agent ovs-agent-compute-1.localdomain update_device_up /usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:255 2019-02-14 18:50:46.298 47 DEBUG neutron.db.provisioning_blocks [req-8d4fc78e-511d-4904-ac3b-dd8b69aafdc8 - - - - -] Provisioning complete for port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee triggered by entity L2. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/pro Notification send to nova: 2019-02-14 18:50:46.482 47 DEBUG neutron.notifiers.nova [-] Sending events: [{'status': 'completed', 'tag': u'b3ada13d-46a6-40a3-a0a5-a4e33cde96ee', 'name': 'network-vif-plugged', 'server_uuid': u'787d531e-fad8-481d-aec6-b9b3ddc2d4f6'}] send_events And then resize is reverted, so instance is again created (or just spawned, I don't know exactly process on nova's side) on compute-0. Nova then asks neutron to bind port on compute-0 again, at it happens: 2019-02-14 18:50:52.482 31 DEBUG neutron.plugins.ml2.managers [req-4a7dcf73-0250-4e7b-98eb-87c2d1e5a22c a8b690b338164076b3ebdf77c73e1aea 2fb49df20cec443bb2cab70931c6f5f4 - default default] Attempting to bind port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee on host compute-0.localdomain for vnic_type normal with profile bind_port /usr/lib/python2.7/site-packages/neutro And after that, L2 agent on compute-0 wire port again: 2019-02-14 18:50:53.968 33 DEBUG neutron.plugins.ml2.rpc [req-aed78ef5-2be4-4235-b631-163b64943623 - - - - -] Device b3ada13d-46a6-40a3-a0a5-a4e33cde96ee up at agent ovs-agent-compute-0.localdomain update_device_up /usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:255 And notification is then send to nova: 2019-02-14 18:50:54.044 33 DEBUG neutron.db.provisioning_blocks [req-aed78ef5-2be4-4235-b631-163b64943623 - - - - -] Provisioning for port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee completed by entity L2. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:132 2019-02-14 18:50:54.050 33 DEBUG neutron.db.provisioning_blocks [req-aed78ef5-2be4-4235-b631-163b64943623 - - - - -] Provisioning complete for port b3ada13d-46a6-40a3-a0a5-a4e33cde96ee triggered by entity L2. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:138 ... 2019-02-14 18:50:54.222 33 DEBUG neutron.notifiers.nova [-] Sending events: [{'status': 'completed', 'tag': u'b3ada13d-46a6-40a3-a0a5-a4e33cde96ee', 'name': 'network-vif-plugged', 'server_uuid': u'787d531e-fad8-481d-aec6-b9b3ddc2d4f6'}] send_events /usr/lib/python2.7/site-packages/neutron/notifiers/nova.py:242 So from neutron perspective all looks in this case exactly the same as it is when You spawn new instance. I don't know nova code but maybe during resize revert nova is not going through this code path: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L5672 and that's why it's not waiting for this event from neutron? From ovs agent logs I found that during resize port was all the time on compute-0: 2019-02-14 18:50:15.627 47296 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["353899e7-1c41-4f9c-a56d-7792c42db0e4","insert","qvob3ada13d-46",["set",[]],["map",[["attached-mac","fa:16:3e:68:bf:14"],["iface-id","b3ada13d-46a6-40a3-a0a5-a4e33cde96ee"],["iface-status","active"],["vm-uuid","787d531e-fad8-481d-aec6-b9b3ddc2d4f6"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:262 2019-02-14 18:50:15.632 47296 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["353899e7-1c41-4f9c-a56d-7792c42db0e4","old",null,["set",[]],null],["","new","qvob3ada13d-46",35,["map",[["attached-mac","fa:16:3e:68:bf:14"],["iface-id","b3ada13d-46a6-40a3-a0a5-a4e33cde96ee"],["iface-status","active"],["vm-uuid","787d531e-fad8-481d-aec6-b9b3ddc2d4f6"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:262 2019-02-14 18:55:57.487 47296 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["353899e7-1c41-4f9c-a56d-7792c42db0e4","delete","qvob3ada13d-46",35,["map",[["attached-mac","fa:16:3e:68:bf:14"],["iface-id","b3ada13d-46a6-40a3-a0a5-a4e33cde96ee"],["iface-status","active"],["vm-uuid","787d531e-fad8-481d-aec6-b9b3ddc2d4f6"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:262 Posted a WIP for the new and hopefully final approach to fixing this, updated the link in the external tracker. *** Bug 1694957 has been marked as a duplicate of this bug. *** *** Bug 1694629 has been marked as a duplicate of this bug. *** The reason for the delay is that I'm trying to track down the cause of an upstream test failure [1] on the proposed fix. Specifically, the patch isn't passing the test_resize_server_revert tests because the instance fails the revert resize with a timeout waiting for the network-vif-plugged event. The patch modifies Nova to start waiting for external events before we call any of the Neutron API methods in Nova (setup_networks_on_host and migrate_instance_finish) which would cause the port binding to be updated and the external event to be sent. From what I can tell, on the Nova side of things, we go straight from "Preparing to wait for external events" to "Timeout waiting for external events", with nothing in between (formatted for legibility): Mar 27 20:32:45.175033 [./logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/controller/logs/screen-n-cpu.txt.gz] ubuntu-bionic-rax-ord-0004428291 nova-compute[19301]: DEBUG nova.compute.manager [None req-35e5f0c9-9e63-4f35-ad59-bbd6aa71cf0c tempest-ServerActionsTestJSON-1612454868 tempest-ServerActionsTestJSON-1612454868] [instance: 9ce66ab0-abdf-40ce-8ab9-c719e61b4366] Preparing to wait for external event network-vif-plugged-5340d74a-184b-4c08-8bd8-d41ca9d6f21e {{(pid=19301) prepare_for_instance_event /opt/stack/nova/nova/compute/manager.py:326}} Mar 27 20:37:45.258678 [./logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/controller/logs/screen-n-cpu.txt.gz] ubuntu-bionic-rax-ord-0004428291 nova-compute[19301]: ERROR nova.compute.manager [None req-35e5f0c9-9e63-4f35-ad59-bbd6aa71cf0c tempest-ServerActionsTestJSON-1612454868 tempest-ServerActionsTestJSON-1612454868] Timeout waiting for Neutron events [('network-vif-plugged', '5340d74a-184b-4c08-8bd8-d41ca9d6f21e')] On the Neutron side of things, I see the port being *removed*, for reasons I don't understand: Mar 27 20:32:46.349332 [./logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/controller/logs/screen-q-agt.txt.gz] ubuntu-bionic-rax-ord-0004428291 neutron-openvswitch-agent[11536]: INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-299c3b48-0da2-481d-a95b-585d1ae08e51 None None] Ports {'5340d74a-184b-4c08-8bd8-d41ca9d6f21e'} removed The external event is sent to Nova, leaving Nova understandably confused: Mar 27 20:32:46.473200 [./logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/controller/logs/screen-q-svc.txt.gz] ubuntu-bionic-rax-ord-0004428291 neutron-server[10259]: DEBUG neutron.notifiers.nova [-] Sending events: [{'server_uuid': '9ce66ab0-abdf-40ce-8ab9-c719e61b4366', 'name': 'network-vif-unplugged', 'status': 'completed', 'tag': '5340d74a-184b-4c08-8bd8-d41ca9d6f21e'}] {{(pid=10638) send_events /opt/stack/neutron/neutron/notifiers/nova.py:245}} Mar 27 20:32:46.535510 [./logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/controller/logs/screen-n-cpu.txt.gz] ubuntu-bionic-rax-ord-0004428291 nova-compute[19301]: WARNING nova.compute.manager [req-c1e216aa-630d-4eb2-be77-5bfdeb70d9d2 req-75bf2f97-15ba-4d8a-b610-82eb4fe51efd service nova] [instance: 9ce66ab0-abdf-40ce-8ab9-c719e61b4366] Received unexpected event network-vif-unplugged-5340d74a-184b-4c08-8bd8-d41ca9d6f21e for instance with vm_state resized and task_state resize_reverting. Slaweq (or any other Neutron folks), could you take a look at the attached logs and see if you can shed light on why Neutron removes the port? [1] http://logs.openstack.org/81/644881/4/check/tempest-full-py3/49b167c/testr_results.html.gz Created attachment 1554341 [details]
Neutron and Nova logs from all services from the upstream gate, sorted and merged with os-log-merger
After talking about this with Slawek, it appears the port being removed is a consequence of the initial resize. I've pushed a new DNM patch upstream with extra logs to try and indentify why Nova isn't sending any update port binding request to Neutron during the revert. *** Bug 1710043 has been marked as a duplicate of this bug. *** *** Bug 1716564 has been marked as a duplicate of this bug. *** Hello, we are facing the exact same problem. Is there already any fix out there? the fix is in upstream review https://review.opendev.org/#/c/644881/ (overcloud) [stack@undercloud-0 tempestdir]$ tempest run --regex test_resize_server_revert {0} tempest.api.compute.admin.test_migrations.MigrationsAdminTest.test_resize_server_revert_deleted_flavor [83.179225s] ... ok {1} tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert [47.407996s] ... ok {1} tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert_with_volume_attached [71.966658s] ... ok ====== Totals ====== Ran: 3 tests in 159.9172 sec. - Passed: 3 - Skipped: 0 - Expected Fail: 0 - Unexpected Success: 0 - Failed: 0 Sum of execute time for each test: 202.5539 sec. ============== Worker Balance ============== - Worker 0 (1 tests) => 0:01:23.179225 - Worker 1 (2 tests) => 0:01:59.379387 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2811 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |
Created attachment 1536290 [details] Changed logs from /var/log from compute-1 during failed test run Description of problem: Tempest test "tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert" (also test_resize_server_revert_with_volume_attached) fails frequently during REVERT_RESIZE operation. REVERT_RESIZE transitions to ERROR state after 300s timeout with error message reported (nova show or /var/log/containers/nova/nova-compute.log on compute node): ... ERROR nova.compute.manager [req-1eee0290-b924-407b-b901-c248173c1542 c4940ef2d8e346dd91153046585e5c61 8eed26b767844e4bb03f561250fdb69d - default default] [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] Setting instance vm_state to ERROR: VirtualInterfaceCreateException: Virtual Interface creation failed ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] Traceback (most recent call last): ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7954, in _error_out_instance_on_exception ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] yield ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4151, in finish_revert_resize ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] block_device_info, power_on) ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8505, in finish_revert_migration ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] power_on=power_on) ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5648, in _create_domain_and_network ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] raise exception.VirtualInterfaceCreateException() ERROR nova.compute.manager [instance: be01b60c-3af6-4ebc-9d70-5a6edb8b03f0] VirtualInterfaceCreateException: Virtual Interface creation failed ... Neutron reports port is ACTIVE, until timeout is reached, then port is reported as DOWN alongside VM reaching ERROR state: | 2abbba96-5239-4b26-810e-d949345472be | | fa:16:3e:cf:31:03 | ip_address='10.100.0.5', subnet_id='2061db16-c336-4243-be4f-e9435e521225' | DOWN | $ openstack port show 2abbba96-5239-4b26-810e-d949345472be +-----------------------+---------------------------------------------------------------------------+ | Field | Value | +-----------------------+---------------------------------------------------------------------------+ | admin_state_up | UP | | allowed_address_pairs | | | binding_host_id | compute-1.localdomain | | binding_profile | | | binding_vif_details | datapath_type='system', ovs_hybrid_plug='True', port_filter='True' | | binding_vif_type | ovs | | binding_vnic_type | normal | | created_at | 2019-02-19T11:09:56Z | | data_plane_status | None | | description | | | device_id | 2ab7a9d0-e99e-40da-a21e-4dc43928b842 | | device_owner | compute:nova | | dns_assignment | None | | dns_domain | None | | dns_name | None | | extra_dhcp_opts | | | fixed_ips | ip_address='10.100.0.5', subnet_id='2061db16-c336-4243-be4f-e9435e521225' | | id | 2abbba96-5239-4b26-810e-d949345472be | | mac_address | fa:16:3e:cf:31:03 | | name | | | network_id | 13047326-e080-4361-9c96-56fe197f0e75 | | port_security_enabled | True | | project_id | 6481555f58c9415d8d0967ec7b52de93 | | qos_policy_id | None | | revision_number | 13 | | security_group_ids | 030cada4-78f1-4390-82a1-706dccca84cf | | status | DOWN | | tags | | | trunk_details | None | | updated_at | 2019-02-19T11:15:32Z | +-----------------------+---------------------------------------------------------------------------+ On one of the controllers I can see error: DEBUG neutron.plugins.ml2.plugin [req-2dffc726-8391-48fb-bffd-aee5e448b9f3 - - - - -] Port 2abbba96-5239-4b26-810e-d949345472be cannot update to ACTIVE because it is not bound. Version-Release number of selected component (if applicable): OSP14, started to appear since puddle 2019-02-07.1 How reproducible: I can see this issue in about 60-75% of cases, in both CI and manual deployments. In CI this issue appears in deployments with 3 controllers, 2 computes + ceph backend, ipv4 (vlan, vxlan, gre) Steps to Reproduce: 1. Deploy mentioned topologies using InfraRed with Tempest stage included 2. Wait Tempest to finish full test suite 3. Manually re-run one of test_resize_server_revert.* tests multiple times Actual results: In more than 50% of cases these tests fail Expected results: Tests should pass in 100% of cases Additional info: /var/log gathered during this test on compute-1 node attached (compute-1-failed-run.tar.gz) tempest.log for failed test attached (tempest.resize_server_revert.fail.log)