Description of problem: When using bonding in 802.3ad mode with very heavy stress, removing the last slave of the active aggregator might result in long failover to another aggregator (upto 90 sec.) due to LACPDU packets being dropped from the slaves tx queue. The solution is to send such packets with the highest priority. The solution was verified to work with 10/100/1000Mbps adapters, but might not be good enough when using 10/100 adapter - in which case it will be necessary to wait the entire timeout defind by the IEEE standard. Version-Release number of selected component (if applicable): kernel-2.4.20-1.1931.2.231.2.11.ent How reproducible: Configure a bonding team in 802.3ad mode with 2 or more gig adapters and 2 or more 10/100 adapters. Start heavy bi-directional stress traffic between the server and the clients and remove the gig slaves from the bond one by one. Once the last gig slave is removed, traffic may stall until the new aggregator is selected (may vary between switches). Steps to Reproduce: 1. insmod bonding mode=4 2. ifconfig bond0 <ip-addr> 3. ifenslave bond0 eth0 eth1 eth2 eth5 eth6 eth7 4. run stress traffic (e.g. iperf, netperf, etc.) 5. do ifenslave -c on all gig slaves 6. monitor traffic on remaining slaves Actual results: Traffic stalls for up to 90 sec. until a new aggregator is selected. Expected results: Traffic restarts immediately using another aggregator. Additional info: A bug fix patch was sent by me on June 26th to bond-devel, linux-net and linux- netdev lists. It was already accepted by Jeff Garzik into his net-drivers-2.4 BK tree. It is a further fix for the original problem reported by Jay Vosburgh regarding the same problem without stress traffic.
ISSUE TRACKER 25887 OPENED AS SEV 1
Jeff, do we have patches for this in Taroon already or are they still in your queue ?
Appears to be fix implemented in RHEL 3 B1 candidate kernel (version 2.4.21- 1.1931.2.349.2.2.ent).