Bug 98464

Summary: bonding 802.3ad long failover time under heavy stress
Product: Red Hat Enterprise Linux 3 Reporter: Need Real Name <shmulik.hen>
Component: kernelAssignee: Jeff Garzik <jgarzik>
Status: CLOSED RAWHIDE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: peterm, riel, tao
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
URL: http://sourceforge.net/projects/bonding/
Whiteboard:
Fixed In Version: 2.4.21-1.1931.2.349.2.2.ent Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-08-03 13:55:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Need Real Name 2003-07-02 18:04:49 UTC
Description of problem:
When using bonding in 802.3ad mode with very heavy stress, removing the last 
slave of the active aggregator might result in long failover to another 
aggregator (upto 90 sec.) due to LACPDU packets being dropped from the slaves 
tx queue. The solution is to send such packets with the highest priority. The 
solution was verified to work with 10/100/1000Mbps adapters, but might not be 
good enough when using 10/100 adapter - in which case it will be necessary to 
wait the entire timeout defind by the IEEE standard.

Version-Release number of selected component (if applicable):
kernel-2.4.20-1.1931.2.231.2.11.ent

How reproducible:
Configure a bonding team in 802.3ad mode with 2 or more gig adapters and 2 or 
more 10/100 adapters. Start heavy bi-directional stress traffic between the 
server and the clients and remove the gig slaves from the bond one by one. Once 
the last gig slave is removed, traffic may stall until the new aggregator is 
selected (may vary between switches).

Steps to Reproduce:
1. insmod bonding mode=4
2. ifconfig bond0 <ip-addr>
3. ifenslave bond0 eth0 eth1 eth2 eth5 eth6 eth7
4. run stress traffic (e.g. iperf, netperf, etc.)
5. do ifenslave -c on all gig slaves
6. monitor traffic on remaining slaves
    
Actual results:
Traffic stalls for up to 90 sec. until a new aggregator is selected.

Expected results:
Traffic restarts immediately using another aggregator.

Additional info:
A bug fix patch was sent by me on June 26th to bond-devel, linux-net and linux-
netdev lists. It was already accepted by Jeff Garzik into his net-drivers-2.4 
BK tree. It is a further fix for the original problem reported by Jay Vosburgh 
regarding the same problem without stress traffic.

Comment 1 Larry Troan 2003-07-16 13:55:44 UTC
ISSUE TRACKER 25887 OPENED AS SEV 1

Comment 2 Rik van Riel 2003-07-16 13:59:11 UTC
Jeff, do we have patches for this in Taroon already or are they still in your
queue ?

Comment 3 Need Real Name 2003-07-23 17:14:01 UTC
Appears to be fix implemented in RHEL 3 B1 candidate kernel (version 2.4.21-
1.1931.2.349.2.2.ent).