Description of problem: A system hung in ad_rx_machine. From the stack traces, we found that ad_rx_machine took a lock using a plain spin_lock, got interrupted, ran softirq's, and the same code path ran again and tried to take the same lock. We found that upstream code (kernel 2.6.26) uses spin_lock_bh and spin_unlock_bh instead. This appears to fix the problem here, though since it was not reproducible on demand that is hard to prove. The fix is in drivers/net/bonding/bond_3ad.c to change inline routine __get_rx_machine_lock to do spin_lock_bh instead of spin_lock and inline routine __release_rx_machine_lock to do spin_unlock_bh instead of spin_unlock. Version-Release number of selected component (if applicable): 2.6.18 kernel (bonding.ko driver) How reproducible: Does not reproduce on demand. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Thanks for filing this. I plan to pull an update to the bonding driver for the next rhel update so I will post here when there are test kernels available that contain this fix. This upstream commit is the one we need to make sure to include: commit 2bf86b7aa8e74bf81a9872f7b610f49b610a4649 Author: Jay Vosburgh <fubar.com> Date: Fri Mar 21 22:29:33 2008 -0700 bonding: Fix locking in 802.3ad mode The 802.3ad state machine lock can be acquired in both softirq and not softirq context, but was not held at _bh to prevent a deadlock (which could occur if a LACPDU arrived and was processed while the lock was held). Corrected this, now hold the state machine lock at _bh to prevent deadlock. Bug reported by Todd Fleisher <todd>. Signed-off-by: Jay Vosburgh <fubar.com> Signed-off-by: Jeff Garzik <jeff>
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/#rhel5 Please test them and report back your results. Without immediate feedback there is a good chance this or any other fix for this driver will not be included in the upcoming update.
It looks OK (though the original problem was not reproducible on demand anyway, of course). We had been occluding the normal bonding.ko file with one that has this fix in it already.
in kernel-2.6.18-111.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Partners, this bug should be fixed in the latest RHEL 5.3 Snapshot. We believe that you have some interest in its correct functionality, so we're making a friendly request to send us some testing feedback. If you have a chance to test it, please share with us your findings. If you have successfully VERIFIED the fix, please add PartnerVerified to the Bugzilla keywords, along with a description of the results. Thanks!
~~ Snapshot 6 is out ~~ Partners, please test and let us know if this bug has been fixed. Add PartnerVerified keyword if everything works as expected. For any new issues encountered, CLONE this bug and report the issues in the new bug.
The fixed bonding driver was verified during our extensive release testing cycle with RHEL5.2. Stratus did not have the opportunity to retest with the 5.3 snapshot kernel; however, source inspection of bond_3ad.c in kernel 2.6.18-126 shows that the fix is present. We believe this one is fixed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html
*** Bug 502070 has been marked as a duplicate of this bug. ***
*** Bug 516867 has been marked as a duplicate of this bug. ***