Bug 1727856 - Instance evacuation is failing in OVN environment
Summary: Instance evacuation is failing in OVN environment
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Terry Wilson
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks: 1731968
TreeView+ depends on / blocked
 
Reported: 2019-07-08 11:27 UTC by Sandeep Yadav
Modified: 2020-11-17 12:04 UTC (History)
29 users (show)

Fixed In Version: python-networking-ovn-4.0.3-14.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, when you performed a hard reset on a node, `ovn-controller` was disabled without cleaning the entry in the chassis column, and the corresponding port status was not set to `down`. `networking-ovn` monitors only for status changes between `down` and `up`, and so the port binding status was not updated and the port did not become `ACTIVE`. As a result, instance evacuation failed because the port was not `ACTIVE`. With this update, `networking-ovn` monitors the `Port_Binding` table for chassis changes and triggers a port status change to `ACTIVE` when a port moves to another chassis. As a result, instance evacuation functions normally.
Clone Of:
Environment:
Last Closed: 2019-11-07 14:00:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 677603 0 'None' MERGED Fix evacuation when host dies uncleanly 2020-11-19 12:09:14 UTC
OpenStack gerrit 678241 0 'None' MERGED Fix evacuation when host dies uncleanly 2020-11-19 12:09:14 UTC
Red Hat Product Errata RHBA-2019:3803 0 None None None 2019-11-07 14:00:34 UTC

Description Sandeep Yadav 2019-07-08 11:27:48 UTC
Description of problem:

Instance evacuation is failing in an environment deployed with OVN with below errors after time out.

~~~
2019-07-08 05:55:34.522 1 WARNING nova.virt.libvirt.driver [req-665011e5-be4c-43ac-a002-7ba661caeb46 7b5516ddd8a9477388a1f4e8e0764fa2 3d769c76682347f597643ad3509b5354 - default default] [instance: XXXXXX-269e-XXXXXX-221e3ae0739] Timeout waiting for [('network-vif-plugged', u'065645c2-c485-44a0-84ba-4336b9fcd41d')] for instance with vm_state active and task_state rebuild_spawning.: Timeout: 300 seconds
~~~

~~~
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [req-665011e5-be4c-43ac-a002-7ba661caeb46 7b5516ddd8a9477388a1f4e8e0764fa2 3d769c76682347f597643ad3509b5354 - default default] [instance: 
XXXXXX-269e-XXXXXX-221e3ae0739] Setting instance vm_state to ERROR: VirtualInterfaceCreateException: Virtual Interface creation failed
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] Traceback (most recent call last):
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7559, in _error_out_instance_on_exception
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]     yield
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2904, in rebuild_instance
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]     migration, request_spec)
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2966, in _do_rebuild_instance_with_claim
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]     self._do_rebuild_instance(*args, **kwargs)
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3123, in _do_rebuild_instance
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]     self._rebuild_default_impl(**kwargs)
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2810, in _rebuild_default_impl
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]     block_device_info=new_block_device_info)
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3114, in spawn
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]     destroy_disks_on_failure=True)
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5597, in _create_domain_and_network
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739]     raise exception.VirtualInterfaceCreateException()
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] VirtualInterfaceCreateException: Virtual Interface creation failed
2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] 
~~~


Version-Release number of selected component (if applicable):

# cat /etc/rhosp-release 
Red Hat OpenStack Platform release 13.0.5 (Queens)

$ rpm -qa | grep -i openstack-nova
openstack-nova-api-17.0.9-2.el7ost.noarch                   Mon May 20 05:39:18 2019
openstack-nova-common-17.0.9-2.el7ost.noarch                Mon May 20 05:39:03 2019
openstack-nova-compute-17.0.9-2.el7ost.noarch               Mon May 20 05:39:07 2019
openstack-nova-conductor-17.0.9-2.el7ost.noarch             Mon May 20 05:39:18 2019
openstack-nova-console-17.0.9-2.el7ost.noarch               Mon May 20 05:39:18 2019
openstack-nova-migration-17.0.9-2.el7ost.noarch             Mon May 20 05:39:17 2019
openstack-nova-novncproxy-17.0.9-2.el7ost.noarch            Mon May 20 05:39:18 2019
openstack-nova-placement-api-17.0.9-2.el7ost.noarch         Mon May 20 05:39:18 2019
openstack-nova-scheduler-17.0.9-2.el7ost.noarch             Mon May 20 05:39:18 2019


 $ rpm -qa | grep -i networking-ovn
python-networking-ovn-4.0.3-3.el7ost.noarch                 Mon May 20 05:39:16 2019
python-networking-ovn-metadata-agent-4.0.3-3.el7ost.noarch  Mon May 20 05:39:17 2019


How reproducible: every time  


Steps to Reproduce:
1. Launch instance
2. Power off compute node  
3. Nova evacuate the instance - It will fail with "Virtual Interface creation failed" error


Actual results:

Nova evacuation is failing

Expected results:

Nova evacuation should work fine


Additional info:

Evacuation is only failing if source compute is down, it is not failing if only nova_compute container is down.

Comment 5 Stephen Finucane 2019-07-12 12:40:14 UTC
Not sure if this is an issue with nova or networking-ovn. Can you please provide sosreports?

Comment 6 Matthew Booth 2019-07-12 14:59:32 UTC
Sosreports: http://collab-shell.usersys.redhat.com/02418390/

Comment 10 Sandeep Yadav 2019-07-19 02:45:53 UTC
Created attachment 1591901 [details]
Workaround testing from lab

Hello Engineering Team,

We have found a workaround as below but we are not sure of its implications:-

~~~
1. On the destination node, change "vif_plugging_is_fatal = False" in nova.conf
2. Restart docker container on destination compute node
~~~


Setting "vif_plugging_is_fatal = False" helps in successfull evacuation.

I have tested the steps in my lab it works for me and customer also. 

Further to test if my interface is up or not(working after evacuation), I attached a floating IP and tried to ping and ssh the instance which worked fine. Testing results are attached in a file.


Could you please confirm the implication of this workaround? Is it safe to do it in a production environment till you come up with an official fix? If not could you please suggest an alternate workaround if possible. 

Regards
Sandeep

Comment 11 Stephen Finucane 2019-07-19 14:41:33 UTC
(In reply to Sandeep Yadav from comment #10)
> Created attachment 1591901 [details]
> Workaround testing from lab
> 
> Hello Engineering Team,
> 
> We have found a workaround as below but we are not sure of its implications:-
> 
> ~~~
> 1. On the destination node, change "vif_plugging_is_fatal = False" in
> nova.conf
> 2. Restart docker container on destination compute node
> ~~~
> 
> 
> Setting "vif_plugging_is_fatal = False" helps in successfull evacuation.
> 
> I have tested the steps in my lab it works for me and customer also. 
> 
> Further to test if my interface is up or not(working after evacuation), I
> attached a floating IP and tried to ping and ssh the instance which worked
> fine. Testing results are attached in a file.
> 
> 
> Could you please confirm the implication of this workaround? Is it safe to
> do it in a production environment till you come up with an official fix? If
> not could you please suggest an alternate workaround if possible. 

This shouldn't be necessary for a spawn operation. I imagine by doing this, you will end up with a migrated server with no network connectivity.

Given that things are failing only when the entire host is done and not when the nova container is down, I imagine the issue lies with neutron/networking-ovn and not nova. For this reason, I'm reassigning this to the relevant component.

Comment 25 Roman Safronov 2019-10-27 13:55:46 UTC
Verified on:
13.0-RHEL-7/2019-10-23.1 with openvswitch-2.9.0-117.bz1733374.1.el7ost.x86_64 and python-networking-ovn-4.0.3-14.el7ost.noarch

Verified that instance evacuation succeeds.

Verified according to the following verification scenario: https://bugzilla.redhat.com/show_bug.cgi?id=1731968#c29

Comment 26 Alex McLeod 2019-10-31 11:32:51 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Comment 28 errata-xmlrpc 2019-11-07 14:00:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3803


Note You need to log in before you can comment on or make changes to this bug.