Bug 1731968 - Instance evacuation failed: Virtual Interface creation failed [NEEDINFO]
Summary: Instance evacuation failed: Virtual Interface creation failed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 15.0 (Stein)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z1
: 15.0 (Stein)
Assignee: Terry Wilson
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On: 1727856
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-22 12:57 UTC by Ido Ovadia
Modified: 2019-10-03 08:42 UTC (History)
24 users (show)

Fixed In Version: python-networking-ovn-6.0.1-0.20190924050427.1242c73.el8ost
Doc Type: Bug Fix
Doc Text:
If a Compute host crashed and ovn-controller did not clean up the Port_Binding chassis column, the Logical_Switch_Port is never set to DOWN. A lack of transition detection between the UP to DOWN state meant there was no update to the port status. This caused problems for workload evacuation during a Compute failure. This patch monitors the Port_Binding chassis column for changes, which resolves the transition detection issues and allows successful evacuation when a Compute node fails.
Clone Of:
Environment:
Last Closed: 2019-10-03 08:42:25 UTC
Target Upstream Version:
morazi: needinfo? (mbracho)


Attachments (Terms of Use)
nova-compute.log (34.60 KB, text/plain)
2019-07-22 12:57 UTC, Ido Ovadia
no flags Details


Links
System ID Priority Status Summary Last Updated
OpenStack gerrit 678239 'None' 'MERGED' 'Fix evacuation when host dies uncleanly' 2019-11-25 14:46:46 UTC
Red Hat Product Errata RHBA-2019:2957 None None None 2019-10-03 08:42:29 UTC

Description Ido Ovadia 2019-07-22 12:57:37 UTC
Created attachment 1592586 [details]
nova-compute.log

Description of problem:
=======================
nova evacuate of single instance failed 

ERROR nova.compute.manager [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] [instance: a3b0ca35-1342-444a-bae2-0731d4930bcc] Setting instance vm_state to ERROR: nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed

Version-Release number of selected component:
=============================================
RHOS_TRUNK-15.0-RHEL-8-20190701.n.0

How reproducible:
=================
100%

Steps to Reproduce:
===================
1. Deploy OSPD 15 HA (undercloud, 3*controller, 2*compute, 3*ceph) 
2. Create an instance
3. Shut down the source compute host
4. Evacuate the instance: nova evacuate instance-test-evc compute-0.localdomain

Actual results:
===============
Evacuation failed 


2019-07-22 12:10:55.132 6 WARNING nova.virt.libvirt.driver [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] [instance: a3b0ca35-1342-444a-bae2-0731d4930bcc] Timeout waiting for [('network-vif-plugged', '2ebb1cde-45a7-498c-9a56-2a276839b710')] for instance with vm_state active and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds
2019-07-22 12:10:55.991 6 INFO os_vif [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] Successfully unplugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:8b:c0:bc,bridge_name='br-int',has_traffic_filtering=True,id=2ebb1cde-45a7-498c-9a56-2a276839b710,network=Network(3626ea0a-ed25-4428-9e86-836287b2bf1f),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap2ebb1cde-45')
2019-07-22 12:10:56.096 6 INFO nova.virt.libvirt.driver [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] [instance: a3b0ca35-1342-444a-bae2-0731d4930bcc] Deleting instance files /var/lib/nova/instances/a3b0ca35-1342-444a-bae2-0731d4930bcc_del
2019-07-22 12:10:56.097 6 INFO nova.virt.libvirt.driver [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] [instance: a3b0ca35-1342-444a-bae2-0731d4930bcc] Deletion of /var/lib/nova/instances/a3b0ca35-1342-444a-bae2-0731d4930bcc_del complete
2019-07-22 12:10:56.616 6 ERROR nova.compute.manager [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] [instance: a3b0ca35-1342-444a-bae2-0731d4930bcc] Setting instance vm_state to ERROR: nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed

Expected results:
=================
Evacuation successfully

Additional info:
================

nova-compute.log enclosed

Comment 1 Stephen Finucane 2019-07-23 12:56:16 UTC
(In reply to Ido Ovadia from comment #0)
> Steps to Reproduce:
> ===================
> 1. Deploy OSPD 15 HA (undercloud, 3*controller, 2*compute, 3*ceph) 
> 2. Create an instance
> 3. Shut down the source compute host
> 4. Evacuate the instance: nova evacuate instance-test-evc
> compute-0.localdomain

When you say shut down the source compute host, what do you mean? The whole compute node or the nova container?

Could we get full sosreports so we can see what's happening from the neutron side also.

Comment 2 Ido Ovadia 2019-07-23 14:43:23 UTC
(In reply to Stephen Finucane from comment #1)
> (In reply to Ido Ovadia from comment #0)
> > Steps to Reproduce:
> > ===================
> > 1. Deploy OSPD 15 HA (undercloud, 3*controller, 2*compute, 3*ceph) 
> > 2. Create an instance
> > 3. Shut down the source compute host
> > 4. Evacuate the instance: nova evacuate instance-test-evc
> > compute-0.localdomain
> 
> When you say shut down the source compute host, what do you mean? The whole
> compute node or the nova container?

Whole compute node https://docs.openstack.org/nova/rocky/admin/evacuate.html

> 
> Could we get full sosreports so we can see what's happening from the neutron
> side also.

Comment 3 Stephen Finucane 2019-07-23 14:48:01 UTC
Okay, thanks. Could we get sosreports, please.

Comment 6 Maciej Józefczyk 2019-07-30 14:11:17 UTC
Please check this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1720675 and https://review.opendev.org/#/c/665581
We decided to set live_migration_wait_for_vif_plug=False for osp15. Related changes are already merged so it should solve this problem. Could you please verify this? The flag live_migration_wait_for_vif_plug should be set to False.

Comment 7 Ido Ovadia 2019-07-31 14:10:33 UTC
(In reply to Maciej Józefczyk from comment #6)
> Please check this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1720675
> and https://review.opendev.org/#/c/665581
> We decided to set live_migration_wait_for_vif_plug=False for osp15. Related
> changes are already merged so it should solve this problem. Could you please
> verify this? The flag live_migration_wait_for_vif_plug should be set to
> False.

live_migration_wait_for_vif_plug set to False

and live migration works successfully

Comment 14 Terry Wilson 2019-08-26 13:42:56 UTC
The upstream gate has been passing the issue on the check phase, but not the gate phase (on the same test). I don't think it is related to the patch, I'm verifying that. But after talking with Daniel, I'm going to remove the blocker flag from this. This is something that has been broken in networking-ovn for a while (it reproduces on OSP13). Although the default is changing to OVN in OSP 15, people who upgrade won't be switched from ml2/ovs to ovn. We're going to treat this as a regular bug.

Comment 29 Roman Safronov 2019-09-26 21:09:25 UTC
Verified on puddle 15.0-RHEL-8/RHOS_TRUNK-15.0-RHEL-8-20190924.n.2 which uses python3-networking-ovn-6.0.1-0.20190924050427.1242c73.el8ost.noarch

Verified that instance evacuation works as expected.

Some details regarding setup and scenario:
OSP15 with OVN HA (3 controllers, 2 computes, 3 ceph nodes). Link to the build: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OSPD-Customized-Deployment-virt/12643/

1. Created external and internal networks, router, keypair, security group with pingable an loginable rules. Connected internal and external networks to the router.
2. Launched an instance connected to the internal network. Created a floating IP for the instance on the external network. Verified that instance is accessible via the floating ip.
3. Turned off ungracefully the compute node where the instance was running. ("Force Off" or "virsh destroy", tried also kernel panic using 'echo c > /proc/sysrq-trigger')
4. Waited until "openstack compute service list" shows turned off compute node as "down"
5. Initiated instance evacuation, i.e. executed command "nova evacuate vm1-net1  compute-1.redhat.local"
6. Checked that instance was rebuilt on the target compute node.
7. Verified that instance is actually running on the target compute node and has connectivity to the network. Instance received hostname and ssh key from metadata service.
8. Powered on the turned off compute host and repeated steps 3-7, this time evacuating to compute-0.redhat.local.

Comment 31 errata-xmlrpc 2019-10-03 08:42:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2957


Note You need to log in before you can comment on or make changes to this bug.