1917675 – Nova evacuate fails due to timeout waiting for a network-vif-plugged event for instance

Bug 1917675 - Nova evacuate fails due to timeout waiting for a network-vif-plugged event for instance

Summary: Nova evacuate fails due to timeout waiting for a network-vif-plugged event fo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ga
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	smooney
QA Contact:	James Parker
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1930432
TreeView+	depends on / blocked

Reported:	2021-01-19 06:17 UTC by Masayuki Igawa
Modified:	2024-12-20 19:31 UTC (History)
CC List:	18 users (show)
Fixed In Version:	openstack-nova-20.5.1-2.20210219095047.42f8679.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-15 07:11:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	640258	None	MERGED	Do not skip ports with ofport unset or invalid	2021-02-18 20:43:56 UTC
OpenStack gerrit	753314	None	MERGED	Filter out port with invalid ofport in OVS firewall	2021-02-18 20:43:56 UTC
OpenStack gerrit	766277	None	MERGED	[OVS] Fix live-migration connection disruption	2021-02-18 20:43:56 UTC
Red Hat Issue Tracker	OSP-2999	None	None	None	2024-10-01 17:22:53 UTC
Red Hat Product Errata	RHEA-2021:3483	None	None	None	2021-09-15 07:11:33 UTC

Description Masayuki Igawa 2021-01-19 06:17:56 UTC

Description of problem:
Nova evacuate fails due to timeout waiting for a network-vif-plugged event for instance

Version-Release number of selected component (if applicable):


How reproducible:
100% evacuation after rebooting the instance

Steps to Reproduce:
1. Stop a compute node:
   $ openstack compute service set --disable --down compute-0 nova-compute
   $ openstack baremetal node power off compute-0
2. Evacuate an instance on the node: nova evacuate <instance-id>
3. The status of the instance changed from ACTIVE to ERROR, finally

Actual results:
The status of the instance changed from ACTIVE to REBUILD, then ERROR

Expected results:
The status of the instance changed from ACTIVE to REBUILD, then ACTIVE

Additional info:
The network is OVS.

Unexpected event network-vif-plugged-<PORT-ID> received before preparing to wait for external event network-vif-plugged-<PORT-ID> like the following.

~~~
...
2021-01-18 18:07:51.043 7 WARNING nova.compute.manager Received unexpected event network-vif-plugged-<PORT-ID> for instance with vm_state active and task_state rebuilding.
...
2021-01-18 18:07:53.841 7 DEBUG nova.compute.manager Preparing to wait for external event network-vif-plugged-<PORT-ID>
...
2021-01-18 18:12:56.131 7 ERROR nova.compute.manager Setting instance vm_state to ERROR
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6442, in _create_domain_and_network
    network_info)
  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 478, in wait_for_instance_event
    actual_event = event.wait()
  File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait
    result = hub.switch()
  File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
    return self.greenlet.switch()
eventlet.timeout.Timeout: 300 seconds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 8974, in _error_out_instance_on_exception
    yield
  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 3388, in rebuild_instance
    migration, request_spec, allocs)
  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 3450, in _do_rebuild_instance_with_claim
    self._do_rebuild_instance(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 3610, in _do_rebuild_instance
    self._rebuild_default_impl(**kwargs)
  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 3275, in _rebuild_default_impl
    block_device_info=new_block_device_info)
  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 3647, in spawn
    cleanup_instance_disks=created_disks)
  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6465, in _create_domain_and_network
    raise exception.VirtualInterfaceCreateException()
nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
...
~~~

Comment 11 Bernard Cafarelli 2021-01-26 09:26:13 UTC

Recapping initial investigation and team discussion, network vif plugged events are not received at expected moments in the migration (also while DPDK is in use here, it may also happen in standard ML2/OVS environments).
As host evacuation flow is really close to live migration one, this is most probably the same issue as the race condition observed in live migrations [1], which was split off from initial live-migration disruption issue [2][3]

Nova is expecting one single event informing about the destination port binding. At this moment, Nova considers the port is bound and ready to transmit data.
But currently on Neutron side, several triggers were firing expectedly this event:
- When the port binding was updated, the port is set to down and then up again, forcing this event.
- When the port binding was updated, first the binding is deleted and then updated with the new information. That triggers in the source host to set the port down and the up again, sending the event.

Rodolfo and Sean worked on cleaning that workflow, and master upstream patches are now merged on both Nova [4] and Neutron [5][6][7]. While this requires additional testing and confirmation, these patches should also fix the observed issue here


[1] https://bugs.launchpad.net/neutron/+bug/1901707
[2] https://bugs.launchpad.net/neutron/+bug/1815989
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1860395
[4] https://review.opendev.org/c/openstack/nova/+/767368
[5] https://review.opendev.org/c/openstack/neutron/+/640258
[6] https://review.opendev.org/c/openstack/neutron/+/753314
|7] https://review.opendev.org/c/openstack/neutron/+/766277

Comment 44 errata-xmlrpc 2021-09-15 07:11:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483

Note You need to log in before you can comment on or make changes to this bug.