Red Hat Bugzilla – Bug 98464
bonding 802.3ad long failover time under heavy stress
Last modified: 2013-07-02 22:12:33 EDT
Description of problem:
When using bonding in 802.3ad mode with very heavy stress, removing the last
slave of the active aggregator might result in long failover to another
aggregator (upto 90 sec.) due to LACPDU packets being dropped from the slaves
tx queue. The solution is to send such packets with the highest priority. The
solution was verified to work with 10/100/1000Mbps adapters, but might not be
good enough when using 10/100 adapter - in which case it will be necessary to
wait the entire timeout defind by the IEEE standard.
Version-Release number of selected component (if applicable):
Configure a bonding team in 802.3ad mode with 2 or more gig adapters and 2 or
more 10/100 adapters. Start heavy bi-directional stress traffic between the
server and the clients and remove the gig slaves from the bond one by one. Once
the last gig slave is removed, traffic may stall until the new aggregator is
selected (may vary between switches).
Steps to Reproduce:
1. insmod bonding mode=4
2. ifconfig bond0 <ip-addr>
3. ifenslave bond0 eth0 eth1 eth2 eth5 eth6 eth7
4. run stress traffic (e.g. iperf, netperf, etc.)
5. do ifenslave -c on all gig slaves
6. monitor traffic on remaining slaves
Traffic stalls for up to 90 sec. until a new aggregator is selected.
Traffic restarts immediately using another aggregator.
A bug fix patch was sent by me on June 26th to bond-devel, linux-net and linux-
netdev lists. It was already accepted by Jeff Garzik into his net-drivers-2.4
BK tree. It is a further fix for the original problem reported by Jay Vosburgh
regarding the same problem without stress traffic.
ISSUE TRACKER 25887 OPENED AS SEV 1
Jeff, do we have patches for this in Taroon already or are they still in your
Appears to be fix implemented in RHEL 3 B1 candidate kernel (version 2.4.21-