Bug 1289825 - LACP with OVS bonds results in VRRP split brain
Summary: LACP with OVS bonds results in VRRP split brain
Keywords:
Status: CLOSED DUPLICATE of bug 1267291
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 8.0 (Liberty)
Assignee: Lance Richardson
QA Contact: Ofer Blaut
Martin Lopes
URL:
Whiteboard:
Depends On:
Blocks: 1290377
TreeView+ depends on / blocked
 
Reported: 2015-12-09 04:35 UTC by Martin Lopes
Modified: 2019-10-10 10:39 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-25 21:32:45 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1275731 None None None 2019-08-08 09:12:41 UTC

Internal Links: 1275731

Description Martin Lopes 2015-12-09 04:35:15 UTC
In deployments with large amounts of network traffic, packet loss could occur when LACP is used with Open vSwitch (OVS) bonds. 

Multicast traffic is particularly affected, and results in split brain issues with Layer 3 High Availability (VRRP), since VRRP is dependent on multicast.

This issue might initially manifest itself as consistently dropped SSH connections:
* Broken pipe
* Read from socket failed: Connection reset by peer

Comment 2 Martin Lopes 2015-12-09 04:36:05 UTC
A possible workaround is to reconfigure the OVS-Bond mode for the bond hosting the vxlan tunnels from balance-tcp to balance-slb.

Comment 6 Dan Sneddon 2015-12-09 17:28:38 UTC
When using OVS bonds with LACP, packet loss has been observed with large amounts of traffic being sent from many VMs. In particular, multicast packet loss has been observed, which can cause issues with Neutron's L2 HA mechanism, which uses VRRP (VRRP relies on multicast for heartbeat traffic).

This issue has been validated and tested in the Red Hat OpenStack Quality Engineering labs.

No fix is currently known for OVS+LACP packet loss at scale. For versions of OSP-Director prior to 7.2, it is recommended that OVS bond mode "balance-slb" be used instead of "balance-tcp" (which requires LACP). In order for "balance-slb" to function, the member interfaces of the bond need to be configured as standard switch access ports with identical VLAN trunking configuration.

In the network interface configuration files, the bonding mode is typically a parameter. We define the parameter value in the network-environment.yaml file prior to deployment:

parameter_defaults:
  BondInterfaceOvsOptions: "bond_mode=balance-slb"

Comment 7 Dan Sneddon 2015-12-09 17:32:30 UTC
OSP-Director 7.2 and above support bonding using the Linux bonding kernel driver. In order to use LACP with Linux bonds in the network configuration files, use the following format for bonds:

-
  type: linux_bond
  name: bond1
  bonding_options: {get_param: BondInterfaceOvsOptions}
  members:
    -
      type: interface
      name: nic2
      primary: true
    -
      type: interface
      name: nic3

Comment 18 Flavio Leitner 2016-02-15 13:52:39 UTC
Hi,

What is the openvswitch version running? 2.4?

Does it work initially and then after some time it starts failing or does it fail since start?

Could you attach the following outputs before and after had reproduced the issue?

# ovs-vsctl show
# ovs-appctl bond/show  <for each bond interface>
# ovs-appctl lacp/show
# ovs-ofctl dump-ports <for each bridge>
# ovs-ofctl dump-flows <for each bridge>

Thanks,
fbl

Comment 21 Flavio Leitner 2016-06-22 17:35:30 UTC
Hi,

The bond mode balance-tcp requires LACP support enabled in the upstream switch.  However, we have seen reports from the field of flows moving to 'drop' or packet loss and in all cases they were caused by duplicated packets.

Basically 802.3ad doesn't allow the same packet to be sent more than once on a trunk, so OVS assumes that it is not going to happen.  However, it will poison the fdb if that happens which can cause connection issues like the one described in this bugzilla.  Moving to balance-slb mode solves the issue because then OVS will receives multicast traffic only on one slave and drop them on the others.

One example is bz#1320723.
Another is https://bugzilla.redhat.com/show_bug.cgi?id=1344648#c9.

So, as a next step I would recommend to capture a traffic dump and look for duplicated packets in the trunk.

Comment 30 Lance Richardson 2016-08-25 21:32:45 UTC
This issue has the same root cause as 1267291 (deferred action FIFO
overflow), marking as duplicate.

*** This bug has been marked as a duplicate of bug 1267291 ***


Note You need to log in before you can comment on or make changes to this bug.