The goal of this bug is to at least describe the problem we have in Neutron+OVN and the possible solution. Currently the live migration with OVN backend does not performe correctly. When the VM TAP port in the destination host is created (this port has the neutron port ID), OVS detects it and creates the needed OF rules. Once the VM is unpaused and the port created, Nova sends the port binding update to set the new host in the port; Neutron detects this change and updates the LSP binding to the new chassis. As you can imagine, when the OF rules are created and the LSP is bound to the new chassis, some VM packets have been lost. Some customers are claiming that the network disconnection is around 50 seconds (to be confirmed). The proposal we present here is to have something similar to Neutron's multiple port binding [1]. This feature allows to Neutron to have multiple port instances in several hosts at the same time. That is useful during the live migration because Neutron can track the status in both hosts, origin and destination. When the destination port is activated, the source port binding is deactivated and deleted. In OVN we have "MAC_Binding" table, in SB. It tracks the LSP association to a chassis. There LSP:MAC_Binding association is 1:0..1; that means if there is a LSP, there could be no binding (the port is not present in any chassis) or just 1. What is proposed is to have a 1:0..2 or 1:0..* association (probably for the case we are presenting 1:0..2 is enough). Of course, only one "MAC_Binding" will be activated at the same time; that introduces the concept of "MAC_Binding.activate", that could be "True" just for one "MAC_Binding" register associated to a LSP. The live migration states could be the following ones: When a VM is created, a Neutron port is created along with a LSP. When the VM creates the TAP port, OVN detects this new port, informs OVN and the "MAC_Binding" record is created. Now Nova commands a live migration. This will update, in Neutron, the port binding. This Neutron port will have two port bindings: the origin host (active) and the destination host (inactive). During the migration, and thanks to [2], Nova creates a in the middle OVS bridge. The VM TAP port will be created on this new bridge. The patch port connecting this port bridge and "br-int", will have the Neutron port ID. That will trigger the OF rule generation on this chassis for this port. At this point, the VM TAP port is still not created. The last state is the destination binding activation. During the post live migration phase, Nova creates the VM TAP port and unpauses the VM in the destination host. QEMU will send several RARPs to update the ARP tables with the new port location. The proposal is that an OVS controller rule, installed on the destination host, monitors those RARPs (sent from known MAC/IP addresses). When detected, OVN will update the LSP binding (1) deleting the origin chassis MAC_Binding register and (2) updating the destination chassis MAC_Binding.activate flag. For any question, please don't hesitate to ask me (ralonsoh) or Sean Mooney (sean-k-mooney). Regards. [1]https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neutron-new-port-binding-api.html [2]https://review.opendev.org/c/openstack/neutron-specs/+/799198
To confirm, the bug is talking about "MAC_Binding" reference being 0..2..., but probably should talk about Port_Binding instead. Correct? The latter is used to track chassis; MAC_Binding is for external ip-mac mapping learning.
Hi Ihar: Right, I made this mistake both in the description and the title (at least I was consistent!). I'll change the title. Thanks!
Initial patch version: https://patchwork.ozlabs.org/project/ovn/patch/20220126031825.405154-1-ihrachys@redhat.com/ (still requires some test failure cleanup and ddlog northd implementation). Need to actually test in OpenStack environment...
(In reply to Ihar Hrachyshka from comment #5) > Initial patch version: > https://patchwork.ozlabs.org/project/ovn/patch/20220126031825.405154-1- > ihrachys/ (still requires some test failure cleanup and ddlog > northd implementation). Need to actually test in OpenStack environment... Any assistance needed from the OSP team - please ping them.
Upstream feature implementation: https://patchwork.ozlabs.org/project/ovn/list/?series=292585
Updated Fixed in version to capture the build that has a fix for localnet-attached switches that improves network downtime for vlan backed networks.