Bug 1903653
Summary: | Instance live-migration observes ping lost OVN | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Ivan Richart <irichart> |
Component: | openstack-neutron | Assignee: | Terry Wilson <twilson> |
Status: | CLOSED CANTFIX | QA Contact: | Eran Kuris <ekuris> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 16.1 (Train) | CC: | alifshit, atragler, bcafarel, bdobreli, bjarolim, dalvarez, dhill, enothen, fdeutsch, ffernand, fgadkano, jlibosva, kizawa-h, madgupta, mdulko, mschindl, oelswah, ralonsoh, scohen, sean.k.mooney, smooney, stephenfin, tmicheli, twilson, ykaul |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | Flags: | twilson:
needinfo-
twilson: needinfo- twilson: needinfo- |
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-01-18 22:18:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2012179, 2076356, 2104522, 2111956 | ||
Bug Blocks: | 1823988 |
Description
Ivan Richart
2020-12-02 15:10:51 UTC
How long is the downtime? As per OSP documentation, there is no guarantee of 0 packet loss and downtime is expected. https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/instances_and_images_guide/migrating-virtual-machines-between-compute-nodes#migration-types You can try to tune the nova parameters: live_migration_permit_post_copy=true live_migration_timeout_action=force_complete live_migration_permit_auto_converge=true You will have to set it on all controllers in /var/lib/config-data/puppet-generated/nova/etc/nova/nova.conf and restart nova services. Please let me know if the provided steps helped in any way and also what are the customer's expectations and what downtime they observe. To summarise where we are at the moment, the change in behavior we're seeting here is due to the change in networking backend and default plugging behavior between OSP 13 and OSP 16.x. In OSP 13, we used ML2/OVS with hybrid plug, meaning iptables was used for enforcing firewalling and security groups (a VM would be connected to the vSwitch through a linux bridge). OSP 16 uses ML2/OVN which does not use hybrid plug (the VM is connected directly to the vSwitch). With hybrid plug, libvirt did not create the ports that are actually attached to OVS dataplane. Instead, this was done by os-vif in pre-live migration. In 16.x, the port than libvirt creates is directly attached to the OVS bridge and OVN observes this and creates necessary flow rules. However, this port creation only happens when the instance is created on the destination host, right before it is resumed. This results in a much reduced interval in which to install the OpenFlow rules. This matters becase on resume, QEMU sends MAC learning frames (reverse ARP packets). Because OVN has not yet installed the flow rules, these are dropped. With patch that has been referenced previously, we're changing the libvirt type from bridge (a TAP device wired up to an actual OVS or linux bridge) to ethernet (a generic, free floating TAP device type). os-vif creates the port in pre-live migrate (rather than during the migration) which means in theory than OVN can create the OpenFlow rules sooner. However, the open question is whether creating the OVS port earlier is sufficient or do we need an actual port on the dataplane? From talking to the engineers on the networking team, it would appear not because the flow rules depend on having the real port. This patch will help slightly, but it won't resolve the issue fully. We see four options: 1. Provide a way to revert to hybrid plugging with ML2/OVN. This will have performance implications. 2. Revert to using ML2/OVS. We're not sure if this will be maintainable long term. 3. See if os-vif can precreate the TAP device. We're not sure this will be backportable. 4. Modify OVN to not request the ofport in its ingress pipeline. We will discuss these options as a team and continue working on the backport in parallel. This BZ is tracking Nova's qmp workaround. For QE, a simple smoke test that continuously pings the instance being live-migrated and compares ping loss before and after the fix. Filed https://bugzilla.redhat.com/show_bug.cgi?id=2042162 for the Nova tracker. closing this as cant fix. as neutron cant fix the lack of live migration support in ovn in 16.1. I have created 3 new bugs in nova for the workaround backport https://bugzilla.redhat.com/show_bug.cgi?id=2042165 is for osp 17 https://bugzilla.redhat.com/show_bug.cgi?id=2042163 is for osp 16.2 and https://bugzilla.redhat.com/show_bug.cgi?id=2042162 is for osp 16.1 we need to backport to all 3 release with separate bugs in that order. i have already submitted the upstream backports and started premetive backprots downstream for 17 I will do the 16.2 and 16.1 premtive backports shortly. for regression reasons we cannot release the fix on 16.1 until after it is merged in the 16.2 branch to avoid regressions. this is a latent issue so it does not qualify as a blocker for the next two osp 16 z stream release we can however potially do a hotfix once the patches are merged across all 3 downstream branches. *** Bug 1872937 has been marked as a duplicate of this bug. *** |