Description of problem: Instance evacuation is failing in an environment deployed with OVN with below errors after time out. ~~~ 2019-07-08 05:55:34.522 1 WARNING nova.virt.libvirt.driver [req-665011e5-be4c-43ac-a002-7ba661caeb46 7b5516ddd8a9477388a1f4e8e0764fa2 3d769c76682347f597643ad3509b5354 - default default] [instance: XXXXXX-269e-XXXXXX-221e3ae0739] Timeout waiting for [('network-vif-plugged', u'065645c2-c485-44a0-84ba-4336b9fcd41d')] for instance with vm_state active and task_state rebuild_spawning.: Timeout: 300 seconds ~~~ ~~~ 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [req-665011e5-be4c-43ac-a002-7ba661caeb46 7b5516ddd8a9477388a1f4e8e0764fa2 3d769c76682347f597643ad3509b5354 - default default] [instance: XXXXXX-269e-XXXXXX-221e3ae0739] Setting instance vm_state to ERROR: VirtualInterfaceCreateException: Virtual Interface creation failed 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] Traceback (most recent call last): 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7559, in _error_out_instance_on_exception 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] yield 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2904, in rebuild_instance 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] migration, request_spec) 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2966, in _do_rebuild_instance_with_claim 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] self._do_rebuild_instance(*args, **kwargs) 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3123, in _do_rebuild_instance 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] self._rebuild_default_impl(**kwargs) 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2810, in _rebuild_default_impl 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] block_device_info=new_block_device_info) 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3114, in spawn 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] destroy_disks_on_failure=True) 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5597, in _create_domain_and_network 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] raise exception.VirtualInterfaceCreateException() 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] VirtualInterfaceCreateException: Virtual Interface creation failed 2019-07-08 05:55:36.768 1 ERROR nova.compute.manager [instance: XXXXXX-269e-XXXXXX-221e3ae0739] ~~~ Version-Release number of selected component (if applicable): # cat /etc/rhosp-release Red Hat OpenStack Platform release 13.0.5 (Queens) $ rpm -qa | grep -i openstack-nova openstack-nova-api-17.0.9-2.el7ost.noarch Mon May 20 05:39:18 2019 openstack-nova-common-17.0.9-2.el7ost.noarch Mon May 20 05:39:03 2019 openstack-nova-compute-17.0.9-2.el7ost.noarch Mon May 20 05:39:07 2019 openstack-nova-conductor-17.0.9-2.el7ost.noarch Mon May 20 05:39:18 2019 openstack-nova-console-17.0.9-2.el7ost.noarch Mon May 20 05:39:18 2019 openstack-nova-migration-17.0.9-2.el7ost.noarch Mon May 20 05:39:17 2019 openstack-nova-novncproxy-17.0.9-2.el7ost.noarch Mon May 20 05:39:18 2019 openstack-nova-placement-api-17.0.9-2.el7ost.noarch Mon May 20 05:39:18 2019 openstack-nova-scheduler-17.0.9-2.el7ost.noarch Mon May 20 05:39:18 2019 $ rpm -qa | grep -i networking-ovn python-networking-ovn-4.0.3-3.el7ost.noarch Mon May 20 05:39:16 2019 python-networking-ovn-metadata-agent-4.0.3-3.el7ost.noarch Mon May 20 05:39:17 2019 How reproducible: every time Steps to Reproduce: 1. Launch instance 2. Power off compute node 3. Nova evacuate the instance - It will fail with "Virtual Interface creation failed" error Actual results: Nova evacuation is failing Expected results: Nova evacuation should work fine Additional info: Evacuation is only failing if source compute is down, it is not failing if only nova_compute container is down.
Not sure if this is an issue with nova or networking-ovn. Can you please provide sosreports?
Sosreports: http://collab-shell.usersys.redhat.com/02418390/
Created attachment 1591901 [details] Workaround testing from lab Hello Engineering Team, We have found a workaround as below but we are not sure of its implications:- ~~~ 1. On the destination node, change "vif_plugging_is_fatal = False" in nova.conf 2. Restart docker container on destination compute node ~~~ Setting "vif_plugging_is_fatal = False" helps in successfull evacuation. I have tested the steps in my lab it works for me and customer also. Further to test if my interface is up or not(working after evacuation), I attached a floating IP and tried to ping and ssh the instance which worked fine. Testing results are attached in a file. Could you please confirm the implication of this workaround? Is it safe to do it in a production environment till you come up with an official fix? If not could you please suggest an alternate workaround if possible. Regards Sandeep
(In reply to Sandeep Yadav from comment #10) > Created attachment 1591901 [details] > Workaround testing from lab > > Hello Engineering Team, > > We have found a workaround as below but we are not sure of its implications:- > > ~~~ > 1. On the destination node, change "vif_plugging_is_fatal = False" in > nova.conf > 2. Restart docker container on destination compute node > ~~~ > > > Setting "vif_plugging_is_fatal = False" helps in successfull evacuation. > > I have tested the steps in my lab it works for me and customer also. > > Further to test if my interface is up or not(working after evacuation), I > attached a floating IP and tried to ping and ssh the instance which worked > fine. Testing results are attached in a file. > > > Could you please confirm the implication of this workaround? Is it safe to > do it in a production environment till you come up with an official fix? If > not could you please suggest an alternate workaround if possible. This shouldn't be necessary for a spawn operation. I imagine by doing this, you will end up with a migrated server with no network connectivity. Given that things are failing only when the entire host is done and not when the nova container is down, I imagine the issue lies with neutron/networking-ovn and not nova. For this reason, I'm reassigning this to the relevant component.
Verified on: 13.0-RHEL-7/2019-10-23.1 with openvswitch-2.9.0-117.bz1733374.1.el7ost.x86_64 and python-networking-ovn-4.0.3-14.el7ost.noarch Verified that instance evacuation succeeds. Verified according to the following verification scenario: https://bugzilla.redhat.com/show_bug.cgi?id=1731968#c29
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to -.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3803