Description of problem:
A system hung in ad_rx_machine. From the stack traces, we found that
ad_rx_machine took a lock using a plain spin_lock, got interrupted, ran
softirq's, and the same code path ran again and tried to take the same lock. We
found that upstream code (kernel 2.6.26) uses spin_lock_bh and spin_unlock_bh
instead. This appears to fix the problem here, though since it was not
reproducible on demand that is hard to prove. The fix is in
drivers/net/bonding/bond_3ad.c to change inline routine __get_rx_machine_lock to
do spin_lock_bh instead of spin_lock and inline routine
__release_rx_machine_lock to do spin_unlock_bh instead of spin_unlock.
Version-Release number of selected component (if applicable):
2.6.18 kernel (bonding.ko driver)
Does not reproduce on demand.
Steps to Reproduce:
Thanks for filing this. I plan to pull an update to the bonding driver for the next rhel update so I will post here when there are test kernels available that contain this fix. This upstream commit is the one we need to make sure to include:
Author: Jay Vosburgh <firstname.lastname@example.org>
Date: Fri Mar 21 22:29:33 2008 -0700
bonding: Fix locking in 802.3ad mode
The 802.3ad state machine lock can be acquired in both softirq and
not softirq context, but was not held at _bh to prevent a deadlock (which
could occur if a LACPDU arrived and was processed while the lock was
Corrected this, now hold the state machine lock at _bh to prevent
Bug reported by Todd Fleisher <email@example.com>.
Signed-off-by: Jay Vosburgh <firstname.lastname@example.org>
Signed-off-by: Jeff Garzik <email@example.com>
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
My test kernels have been updated to include a patch for this bugzilla.
Please test them and report back your results. Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.
It looks OK (though the original problem was not reproducible on demand anyway, of course).
We had been occluding the normal bonding.ko file with one that has this fix in it already.
You can download this test kernel from http://people.redhat.com/dzickus/el5
Partners, this bug should be fixed in the latest RHEL 5.3 Snapshot. We believe that you have some interest in its correct functionality, so we're making a friendly request to send us some testing feedback.
If you have a chance to test it, please share with us your findings. If you have successfully VERIFIED the fix, please add PartnerVerified to the Bugzilla keywords, along with a description of the results. Thanks!
~~ Snapshot 6 is out ~~ Partners, please test and let us know if this bug has been fixed. Add PartnerVerified keyword if everything works as expected. For any new issues encountered, CLONE this bug and report the issues in the new bug.
The fixed bonding driver was verified during our extensive release testing cycle with RHEL5.2. Stratus did not have the opportunity to retest with the 5.3 snapshot kernel; however, source inspection of bond_3ad.c in kernel 2.6.18-126 shows that the fix is present. We believe this one is fixed.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
*** Bug 502070 has been marked as a duplicate of this bug. ***
*** Bug 516867 has been marked as a duplicate of this bug. ***