Description of problem: Nova evacuate fails due to timeout waiting for a network-vif-plugged event for instance Version-Release number of selected component (if applicable): How reproducible: 100% evacuation after rebooting the instance Steps to Reproduce: 1. Stop a compute node: $ openstack compute service set --disable --down compute-0 nova-compute $ openstack baremetal node power off compute-0 2. Evacuate an instance on the node: nova evacuate <instance-id> 3. The status of the instance changed from ACTIVE to ERROR, finally Actual results: The status of the instance changed from ACTIVE to REBUILD, then ERROR Expected results: The status of the instance changed from ACTIVE to REBUILD, then ACTIVE Additional info: The network is OVS. Unexpected event network-vif-plugged-<PORT-ID> received before preparing to wait for external event network-vif-plugged-<PORT-ID> like the following. ~~~ ... 2021-01-18 18:07:51.043 7 WARNING nova.compute.manager Received unexpected event network-vif-plugged-<PORT-ID> for instance with vm_state active and task_state rebuilding. ... 2021-01-18 18:07:53.841 7 DEBUG nova.compute.manager Preparing to wait for external event network-vif-plugged-<PORT-ID> ... 2021-01-18 18:12:56.131 7 ERROR nova.compute.manager Setting instance vm_state to ERROR Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6442, in _create_domain_and_network network_info) File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 478, in wait_for_instance_event actual_event = event.wait() File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait result = hub.switch() File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch return self.greenlet.switch() eventlet.timeout.Timeout: 300 seconds During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 8974, in _error_out_instance_on_exception yield File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 3388, in rebuild_instance migration, request_spec, allocs) File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 3450, in _do_rebuild_instance_with_claim self._do_rebuild_instance(*args, **kwargs) File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 3610, in _do_rebuild_instance self._rebuild_default_impl(**kwargs) File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 3275, in _rebuild_default_impl block_device_info=new_block_device_info) File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 3647, in spawn cleanup_instance_disks=created_disks) File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 6465, in _create_domain_and_network raise exception.VirtualInterfaceCreateException() nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed ... ~~~
Recapping initial investigation and team discussion, network vif plugged events are not received at expected moments in the migration (also while DPDK is in use here, it may also happen in standard ML2/OVS environments). As host evacuation flow is really close to live migration one, this is most probably the same issue as the race condition observed in live migrations [1], which was split off from initial live-migration disruption issue [2][3] Nova is expecting one single event informing about the destination port binding. At this moment, Nova considers the port is bound and ready to transmit data. But currently on Neutron side, several triggers were firing expectedly this event: - When the port binding was updated, the port is set to down and then up again, forcing this event. - When the port binding was updated, first the binding is deleted and then updated with the new information. That triggers in the source host to set the port down and the up again, sending the event. Rodolfo and Sean worked on cleaning that workflow, and master upstream patches are now merged on both Nova [4] and Neutron [5][6][7]. While this requires additional testing and confirmation, these patches should also fix the observed issue here [1] https://bugs.launchpad.net/neutron/+bug/1901707 [2] https://bugs.launchpad.net/neutron/+bug/1815989 [3] https://bugzilla.redhat.com/show_bug.cgi?id=1860395 [4] https://review.opendev.org/c/openstack/nova/+/767368 [5] https://review.opendev.org/c/openstack/neutron/+/640258 [6] https://review.opendev.org/c/openstack/neutron/+/753314 |7] https://review.opendev.org/c/openstack/neutron/+/766277
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483