Bug 1289825

Summary: LACP with OVS bonds results in VRRP split brain
Product: Red Hat OpenStack Reporter: Martin Lopes <mlopes>
Component: openvswitchAssignee: Lance Richardson <lrichard>
Status: CLOSED DUPLICATE QA Contact: Ofer Blaut <oblaut>
Severity: medium Docs Contact: Martin Lopes <mlopes>
Priority: medium    
Version: 7.0 (Kilo)CC: aloughla, amuller, apevec, atragler, bhamrick, chorn, chrisw, dmesser, dsneddon, ealcaniz, lrichard, mleitner, mlopes, nyechiel, rhos-maint, rkhan, skinjo, srevivo, vcojot
Target Milestone: ---Keywords: ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-25 21:32:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1290377    

Description Martin Lopes 2015-12-09 04:35:15 UTC
In deployments with large amounts of network traffic, packet loss could occur when LACP is used with Open vSwitch (OVS) bonds. 

Multicast traffic is particularly affected, and results in split brain issues with Layer 3 High Availability (VRRP), since VRRP is dependent on multicast.

This issue might initially manifest itself as consistently dropped SSH connections:
* Broken pipe
* Read from socket failed: Connection reset by peer

Comment 2 Martin Lopes 2015-12-09 04:36:05 UTC
A possible workaround is to reconfigure the OVS-Bond mode for the bond hosting the vxlan tunnels from balance-tcp to balance-slb.

Comment 6 Dan Sneddon 2015-12-09 17:28:38 UTC
When using OVS bonds with LACP, packet loss has been observed with large amounts of traffic being sent from many VMs. In particular, multicast packet loss has been observed, which can cause issues with Neutron's L2 HA mechanism, which uses VRRP (VRRP relies on multicast for heartbeat traffic).

This issue has been validated and tested in the Red Hat OpenStack Quality Engineering labs.

No fix is currently known for OVS+LACP packet loss at scale. For versions of OSP-Director prior to 7.2, it is recommended that OVS bond mode "balance-slb" be used instead of "balance-tcp" (which requires LACP). In order for "balance-slb" to function, the member interfaces of the bond need to be configured as standard switch access ports with identical VLAN trunking configuration.

In the network interface configuration files, the bonding mode is typically a parameter. We define the parameter value in the network-environment.yaml file prior to deployment:

parameter_defaults:
  BondInterfaceOvsOptions: "bond_mode=balance-slb"

Comment 7 Dan Sneddon 2015-12-09 17:32:30 UTC
OSP-Director 7.2 and above support bonding using the Linux bonding kernel driver. In order to use LACP with Linux bonds in the network configuration files, use the following format for bonds:

-
  type: linux_bond
  name: bond1
  bonding_options: {get_param: BondInterfaceOvsOptions}
  members:
    -
      type: interface
      name: nic2
      primary: true
    -
      type: interface
      name: nic3

Comment 18 Flavio Leitner 2016-02-15 13:52:39 UTC
Hi,

What is the openvswitch version running? 2.4?

Does it work initially and then after some time it starts failing or does it fail since start?

Could you attach the following outputs before and after had reproduced the issue?

# ovs-vsctl show
# ovs-appctl bond/show  <for each bond interface>
# ovs-appctl lacp/show
# ovs-ofctl dump-ports <for each bridge>
# ovs-ofctl dump-flows <for each bridge>

Thanks,
fbl

Comment 21 Flavio Leitner 2016-06-22 17:35:30 UTC
Hi,

The bond mode balance-tcp requires LACP support enabled in the upstream switch.  However, we have seen reports from the field of flows moving to 'drop' or packet loss and in all cases they were caused by duplicated packets.

Basically 802.3ad doesn't allow the same packet to be sent more than once on a trunk, so OVS assumes that it is not going to happen.  However, it will poison the fdb if that happens which can cause connection issues like the one described in this bugzilla.  Moving to balance-slb mode solves the issue because then OVS will receives multicast traffic only on one slave and drop them on the others.

One example is bz#1320723.
Another is https://bugzilla.redhat.com/show_bug.cgi?id=1344648#c9.

So, as a next step I would recommend to capture a traffic dump and look for duplicated packets in the trunk.

Comment 30 Lance Richardson 2016-08-25 21:32:45 UTC
This issue has the same root cause as 1267291 (deferred action FIFO
overflow), marking as duplicate.

*** This bug has been marked as a duplicate of bug 1267291 ***