Bug 1289825
Summary: | LACP with OVS bonds results in VRRP split brain | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Martin Lopes <mlopes> |
Component: | openvswitch | Assignee: | Lance Richardson <lrichard> |
Status: | CLOSED DUPLICATE | QA Contact: | Ofer Blaut <oblaut> |
Severity: | medium | Docs Contact: | Martin Lopes <mlopes> |
Priority: | medium | ||
Version: | 7.0 (Kilo) | CC: | aloughla, amuller, apevec, atragler, bhamrick, chorn, chrisw, dmesser, dsneddon, ealcaniz, lrichard, mleitner, mlopes, nyechiel, rhos-maint, rkhan, skinjo, srevivo, vcojot |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | 8.0 (Liberty) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-08-25 21:32:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1290377 |
Description
Martin Lopes
2015-12-09 04:35:15 UTC
A possible workaround is to reconfigure the OVS-Bond mode for the bond hosting the vxlan tunnels from balance-tcp to balance-slb. When using OVS bonds with LACP, packet loss has been observed with large amounts of traffic being sent from many VMs. In particular, multicast packet loss has been observed, which can cause issues with Neutron's L2 HA mechanism, which uses VRRP (VRRP relies on multicast for heartbeat traffic). This issue has been validated and tested in the Red Hat OpenStack Quality Engineering labs. No fix is currently known for OVS+LACP packet loss at scale. For versions of OSP-Director prior to 7.2, it is recommended that OVS bond mode "balance-slb" be used instead of "balance-tcp" (which requires LACP). In order for "balance-slb" to function, the member interfaces of the bond need to be configured as standard switch access ports with identical VLAN trunking configuration. In the network interface configuration files, the bonding mode is typically a parameter. We define the parameter value in the network-environment.yaml file prior to deployment: parameter_defaults: BondInterfaceOvsOptions: "bond_mode=balance-slb" OSP-Director 7.2 and above support bonding using the Linux bonding kernel driver. In order to use LACP with Linux bonds in the network configuration files, use the following format for bonds: - type: linux_bond name: bond1 bonding_options: {get_param: BondInterfaceOvsOptions} members: - type: interface name: nic2 primary: true - type: interface name: nic3 Hi, What is the openvswitch version running? 2.4? Does it work initially and then after some time it starts failing or does it fail since start? Could you attach the following outputs before and after had reproduced the issue? # ovs-vsctl show # ovs-appctl bond/show <for each bond interface> # ovs-appctl lacp/show # ovs-ofctl dump-ports <for each bridge> # ovs-ofctl dump-flows <for each bridge> Thanks, fbl Hi, The bond mode balance-tcp requires LACP support enabled in the upstream switch. However, we have seen reports from the field of flows moving to 'drop' or packet loss and in all cases they were caused by duplicated packets. Basically 802.3ad doesn't allow the same packet to be sent more than once on a trunk, so OVS assumes that it is not going to happen. However, it will poison the fdb if that happens which can cause connection issues like the one described in this bugzilla. Moving to balance-slb mode solves the issue because then OVS will receives multicast traffic only on one slave and drop them on the others. One example is bz#1320723. Another is https://bugzilla.redhat.com/show_bug.cgi?id=1344648#c9. So, as a next step I would recommend to capture a traffic dump and look for duplicated packets in the trunk. This issue has the same root cause as 1267291 (deferred action FIFO overflow), marking as duplicate. *** This bug has been marked as a duplicate of bug 1267291 *** |