Bug 1795320 - L3 connectivity loss
Summary: L3 connectivity loss
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z11
: 13.0 (Queens)
Assignee: Slawek Kaplonski
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-27 16:59 UTC by Randy Rubins
Modified: 2023-09-07 21:36 UTC (History)
12 users (show)

Fixed In Version: openstack-neutron-12.1.1-3.el7ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-10 11:26:20 UTC
Target Upstream Version:
Embargoed:
skaplons: needinfo-
skaplons: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0770 0 None None None 2020-03-10 11:26:38 UTC

Description Randy Rubins 2020-01-27 16:59:32 UTC
Description of problem:

Using VLAN provider networks and ovs v2.9.0, we are seeing inconsistent open flow rules which we believe are contributing to L3 connectivity issues between 2 VMs on 2 different hypervisors.  The  VMs are dual-homed - connected to two separate provider networks.  Only one of the interfaces experiences the connectivity issues. (please see the attached artifacts)


Version-Release number of selected component (if applicable):
ovs_version: 2.9.0

How reproducible:
Setup a single VM per hypervisor using VLAN provider nets. At random time start observing L3 connectivity loss between 2 VMs, but only on one of the 2 links and only in one direction. 

Steps to Reproduce:
1. Start with a clean environment with full VM-to-VM connectivity across 2 hypervisors
2. After some random time passes, we observe L3 connectivity loss on one of the 2 links and only in one direction.
3. Perform t-shooting and while looking at open flows, we see some strange rules that either should not be there or require further explanation/follow-up.

Actual results:

L3 connectivity is disrupted and remains in this broken state until neutron-ovs-agent is restarted on the hypervisor of the source VM.

Expected results:

L3 connectivity should be maintained throughout.

Additional info:

L2 reachability seems to be ok  (ARP resolution looks fine end to end).  The tests were done with ping. ICMP Echo Reqs observed on the corresponding TAP interfaces but not on the physical interfaces linked with the provider net. 
Testing in the opposite direction of the same flow (just reversing the source and destination VMs) ping works fine – the Echo Replies are received.
Not easily reproducible.

Comment 2 Randy Rubins 2020-01-27 17:09:19 UTC
There are flows for which packets leave the VM but do not make it to the physical interfaces of the hypervisor. We need help figuring out why that is and how to address it in the long run.
In the given example, for the faulty flow, we notice an entry in 'table=73' matching on reg6 (br-int internal vlan) and dl_dst MAC. All other entries of this type point to local TAP interfaces. This one eventually (through more lookups in other tables) points to a vxlan interface and traffic gets dropped (incrementing counters in 'table=92').
The mirrored flow (reversing source and destination) works fine.
ARP traffic flows ok in all directions.
In addition, if the assumption that those flows should not be in OVS proves to be correct :
  - why weren't they cleaned up
  - why were they defined in the first place

Comment 26 errata-xmlrpc 2020-03-10 11:26:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0770


Note You need to log in before you can comment on or make changes to this bug.