Bug 1795320

Summary: L3 connectivity loss
Product: Red Hat OpenStack Reporter: Randy Rubins <rrubins>
Component: openstack-neutronAssignee: Slawek Kaplonski <skaplons>
Status: CLOSED ERRATA QA Contact: Eran Kuris <ekuris>
Severity: medium Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: amuller, apevec, astupnik, chrisw, dalvarez, lmartins, mburns, pmorey, ralonsoh, rhos-maint, scohen, skaplons
Target Milestone: z11Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)Flags: skaplons: needinfo-
skaplons: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-12.1.1-3.el7ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-10 11:26:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Randy Rubins 2020-01-27 16:59:32 UTC
Description of problem:

Using VLAN provider networks and ovs v2.9.0, we are seeing inconsistent open flow rules which we believe are contributing to L3 connectivity issues between 2 VMs on 2 different hypervisors.  The  VMs are dual-homed - connected to two separate provider networks.  Only one of the interfaces experiences the connectivity issues. (please see the attached artifacts)


Version-Release number of selected component (if applicable):
ovs_version: 2.9.0

How reproducible:
Setup a single VM per hypervisor using VLAN provider nets. At random time start observing L3 connectivity loss between 2 VMs, but only on one of the 2 links and only in one direction. 

Steps to Reproduce:
1. Start with a clean environment with full VM-to-VM connectivity across 2 hypervisors
2. After some random time passes, we observe L3 connectivity loss on one of the 2 links and only in one direction.
3. Perform t-shooting and while looking at open flows, we see some strange rules that either should not be there or require further explanation/follow-up.

Actual results:

L3 connectivity is disrupted and remains in this broken state until neutron-ovs-agent is restarted on the hypervisor of the source VM.

Expected results:

L3 connectivity should be maintained throughout.

Additional info:

L2 reachability seems to be ok  (ARP resolution looks fine end to end).  The tests were done with ping. ICMP Echo Reqs observed on the corresponding TAP interfaces but not on the physical interfaces linked with the provider net. 
Testing in the opposite direction of the same flow (just reversing the source and destination VMs) ping works fine – the Echo Replies are received.
Not easily reproducible.

Comment 2 Randy Rubins 2020-01-27 17:09:19 UTC
There are flows for which packets leave the VM but do not make it to the physical interfaces of the hypervisor. We need help figuring out why that is and how to address it in the long run.
In the given example, for the faulty flow, we notice an entry in 'table=73' matching on reg6 (br-int internal vlan) and dl_dst MAC. All other entries of this type point to local TAP interfaces. This one eventually (through more lookups in other tables) points to a vxlan interface and traffic gets dropped (incrementing counters in 'table=92').
The mirrored flow (reversing source and destination) works fine.
ARP traffic flows ok in all directions.
In addition, if the assumption that those flows should not be in OVS proves to be correct :
  - why weren't they cleaned up
  - why were they defined in the first place

Comment 26 errata-xmlrpc 2020-03-10 11:26:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0770