I understand it now. - This is reproducible by having live traffic, ICMP every 0.1 seconds is enough, and restarting OVS agent of compute node hosting a FIP and OVS agent on a network node hosting the gateway for the snat traffic. - The OVS agent with DVR creates a local loop between tunneling and external network. When 2 agents are restarted at the same time, there is a very small window of about 0.5 seconds where both agents have this loop, creating full network loop. When there is a live traffic, the reply traffic gets flooded to the external network, reaches network node and through the loop gets to the tunnel. The tunnel reaches back the compute node and the normal action on br-int learns the source mac address, which is in this case the GW port mac address (fa:16:3e:3c:e6:41 from the comment 20). - The OVS learns in fdb that the GW port MAC belongs to the patch port to the br-tun, since it was observed to arrive from the tunnel. - All reply traffic goes to the GW port first, and OVS normal action no longer floods the traffic, since it knows the MAC now and sends it to the patch port to the br-tun bridge and it's dropped there because it's not expected there. - Since there is no traffic with source MAC of the gw port, the MAC entry expires. - After the expiration, the traffic is renewed. This is a bug on OVS DVR code and it's a question if it's worth fixing the loop itself or just the use of OVS restarts in migration procedure. I'll treat this BZ as the latter and I'm gonna open a new BZ on OVS agent, just to kick off the discussion but I'd be in favor of not fixing it given that it's a deprecated driver and likely the bug has been present since the DVR was introduced.
*** Bug 2225666 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:5138
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days