Bug 2144492

Summary: Restarting OVS with DVR creates a network loop
Product: Red Hat OpenStack Reporter: Roman Safronov <rsafrono>
Component: openstack-neutronAssignee: Jakub Libosvar <jlibosva>
Status: CLOSED ERRATA QA Contact: Roman Safronov <rsafrono>
Severity: high Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: apevec, bcafarel, chrisw, dalvarez, egarciar, gregraka, gurpsing, jamsmith, jelynch, jlibosva, lhh, lsvaty, majopela, mariel, mtomaska, pgrist, scohen, vkhitrin
Target Milestone: z1Keywords: Performance, Reopened, Triaged
Target Release: 17.1Flags: gurpsing: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-18.6.1-1.20230518200972.el9ost Doc Type: Known Issue
Doc Text:
If you migrate a RHOSP 17.1.0 ML2/OVS deployment with distributed virtual routing (DVR) to ML2/OVN, the floating IP (FIP) downtime that occurs during ML2/OVN migration can exceed 60 seconds.
Story Points: ---
Clone Of:
: 2225666 (view as bug list) Environment:
Last Closed: 2023-09-20 00:29:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1823324    

Comment 27 Jakub Libosvar 2023-07-25 15:50:21 UTC
I understand it now.
  - This is reproducible by having live traffic, ICMP every 0.1 seconds is enough, and restarting OVS agent of compute node hosting a FIP and OVS agent on a network node hosting the gateway for the snat traffic. 
  - The OVS agent with DVR creates a local loop between tunneling and external network. When 2 agents are restarted at the same time, there is a very small window of about 0.5 seconds where both agents have this loop, creating full network loop. When there is a live traffic, the reply traffic gets flooded to the external network, reaches network node and through the loop gets to the tunnel. The tunnel reaches back the compute node and the normal action on br-int learns the source mac address, which is in this case the GW port mac address (fa:16:3e:3c:e6:41 from the comment 20). 
  - The OVS learns in fdb that the GW port MAC belongs to the patch port to the br-tun, since it was observed to arrive from the tunnel.
  - All reply traffic goes to the GW port first, and OVS normal action no longer floods the traffic, since it knows the MAC now and sends it to the patch port to the br-tun bridge and it's dropped there because it's not expected there.
  - Since there is no traffic with source MAC of the gw port, the MAC entry expires.
  - After the expiration, the traffic is renewed.

This is a bug on OVS DVR code and it's a question if it's worth fixing the loop itself or just the use of OVS restarts in migration procedure. I'll treat this BZ as the latter and I'm gonna open a new BZ on OVS agent, just to kick off the discussion but I'd be in favor of not fixing it given that it's a deprecated driver and likely the bug has been present since the DVR was introduced.

Comment 37 Jakub Libosvar 2023-08-03 15:58:38 UTC
*** Bug 2225666 has been marked as a duplicate of this bug. ***

Comment 53 errata-xmlrpc 2023-09-20 00:29:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:5138

Comment 54 Red Hat Bugzilla 2024-01-19 04:25:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days