Bug 1289825 - LACP with OVS bonds results in VRRP split brain
LACP with OVS bonds results in VRRP split brain
Status: CLOSED DUPLICATE of bug 1267291
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
medium Severity medium
: ---
: 8.0 (Liberty)
Assigned To: Lance Richardson
Ofer Blaut
Martin Lopes
: ZStream
Depends On:
Blocks: 1290377
  Show dependency treegraph
Reported: 2015-12-08 23:35 EST by Martin Lopes
Modified: 2016-08-25 18:01 EDT (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2016-08-25 17:32:45 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Martin Lopes 2015-12-08 23:35:15 EST
In deployments with large amounts of network traffic, packet loss could occur when LACP is used with Open vSwitch (OVS) bonds. 

Multicast traffic is particularly affected, and results in split brain issues with Layer 3 High Availability (VRRP), since VRRP is dependent on multicast.

This issue might initially manifest itself as consistently dropped SSH connections:
* Broken pipe
* Read from socket failed: Connection reset by peer
Comment 2 Martin Lopes 2015-12-08 23:36:05 EST
A possible workaround is to reconfigure the OVS-Bond mode for the bond hosting the vxlan tunnels from balance-tcp to balance-slb.
Comment 6 Dan Sneddon 2015-12-09 12:28:38 EST
When using OVS bonds with LACP, packet loss has been observed with large amounts of traffic being sent from many VMs. In particular, multicast packet loss has been observed, which can cause issues with Neutron's L2 HA mechanism, which uses VRRP (VRRP relies on multicast for heartbeat traffic).

This issue has been validated and tested in the Red Hat OpenStack Quality Engineering labs.

No fix is currently known for OVS+LACP packet loss at scale. For versions of OSP-Director prior to 7.2, it is recommended that OVS bond mode "balance-slb" be used instead of "balance-tcp" (which requires LACP). In order for "balance-slb" to function, the member interfaces of the bond need to be configured as standard switch access ports with identical VLAN trunking configuration.

In the network interface configuration files, the bonding mode is typically a parameter. We define the parameter value in the network-environment.yaml file prior to deployment:

  BondInterfaceOvsOptions: "bond_mode=balance-slb"
Comment 7 Dan Sneddon 2015-12-09 12:32:30 EST
OSP-Director 7.2 and above support bonding using the Linux bonding kernel driver. In order to use LACP with Linux bonds in the network configuration files, use the following format for bonds:

  type: linux_bond
  name: bond1
  bonding_options: {get_param: BondInterfaceOvsOptions}
      type: interface
      name: nic2
      primary: true
      type: interface
      name: nic3
Comment 18 Flavio Leitner 2016-02-15 08:52:39 EST

What is the openvswitch version running? 2.4?

Does it work initially and then after some time it starts failing or does it fail since start?

Could you attach the following outputs before and after had reproduced the issue?

# ovs-vsctl show
# ovs-appctl bond/show  <for each bond interface>
# ovs-appctl lacp/show
# ovs-ofctl dump-ports <for each bridge>
# ovs-ofctl dump-flows <for each bridge>

Comment 21 Flavio Leitner 2016-06-22 13:35:30 EDT

The bond mode balance-tcp requires LACP support enabled in the upstream switch.  However, we have seen reports from the field of flows moving to 'drop' or packet loss and in all cases they were caused by duplicated packets.

Basically 802.3ad doesn't allow the same packet to be sent more than once on a trunk, so OVS assumes that it is not going to happen.  However, it will poison the fdb if that happens which can cause connection issues like the one described in this bugzilla.  Moving to balance-slb mode solves the issue because then OVS will receives multicast traffic only on one slave and drop them on the others.

One example is bz#1320723.
Another is https://bugzilla.redhat.com/show_bug.cgi?id=1344648#c9.

So, as a next step I would recommend to capture a traffic dump and look for duplicated packets in the trunk.
Comment 30 Lance Richardson 2016-08-25 17:32:45 EDT
This issue has the same root cause as 1267291 (deferred action FIFO
overflow), marking as duplicate.

*** This bug has been marked as a duplicate of bug 1267291 ***

Note You need to log in before you can comment on or make changes to this bug.