Bug 457300

Summary: hang in ad_rx_machine due to second attempt to lock spin_lock
Product: Red Hat Enterprise Linux 5 Reporter: Charlotte Richardson <charlotte.richardson>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: bchan, cward, mgahagan, peterm, roland, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 19:43:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Charlotte Richardson 2008-07-30 19:15:14 UTC
Description of problem:
A system hung in ad_rx_machine. From the stack traces, we found that
ad_rx_machine took a lock using a plain spin_lock, got interrupted, ran
softirq's, and the same code path ran again and tried to take the same lock. We
found that upstream code (kernel 2.6.26) uses spin_lock_bh and spin_unlock_bh
instead. This appears to fix the problem here, though since it was not
reproducible on demand that is hard to prove. The fix is in
drivers/net/bonding/bond_3ad.c to change inline routine __get_rx_machine_lock to
do spin_lock_bh instead of spin_lock and inline routine
__release_rx_machine_lock to do spin_unlock_bh instead of spin_unlock.


Version-Release number of selected component (if applicable):
2.6.18 kernel (bonding.ko driver)


How reproducible:
Does not reproduce on demand.


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Andy Gospodarek 2008-08-19 18:16:12 UTC
Thanks for filing this.  I plan to pull an update to the bonding driver for the next rhel update so I will post here when there are test kernels available that contain this fix.  This upstream commit is the one we need to make sure to include:

commit 2bf86b7aa8e74bf81a9872f7b610f49b610a4649
Author: Jay Vosburgh <fubar.com>
Date:   Fri Mar 21 22:29:33 2008 -0700

    bonding: Fix locking in 802.3ad mode

        The 802.3ad state machine lock can be acquired in both softirq and
    not softirq context, but was not held at _bh to prevent a deadlock (which
    could occur if a LACPDU arrived and was processed while the lock was
    held).

        Corrected this, now hold the state machine lock at _bh to prevent
    deadlock.

        Bug reported by Todd Fleisher <todd>.

    Signed-off-by: Jay Vosburgh <fubar.com>
    Signed-off-by: Jeff Garzik <jeff>

Comment 5 RHEL Program Management 2008-09-02 19:45:01 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Andy Gospodarek 2008-09-09 03:10:00 UTC
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.  Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.

Comment 10 Charlotte Richardson 2008-09-09 15:58:31 UTC
It looks OK (though the original problem was not reproducible on demand anyway, of course).

We had been occluding the normal bonding.ko file with one that has this fix in it already.

Comment 11 Don Zickus 2008-09-11 19:43:43 UTC
in kernel-2.6.18-111.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 14 Chris Ward 2008-11-28 07:15:17 UTC
Partners, this bug should be fixed in the latest RHEL 5.3 Snapshot. We believe that you have some interest in its correct functionality, so we're making a friendly request to send us some testing feedback. 

If you have a chance to test it, please share with us your findings. If you have successfully VERIFIED the fix, please add PartnerVerified to the Bugzilla keywords, along with a description of the results. Thanks!

Comment 15 Chris Ward 2008-12-18 10:39:18 UTC
~~ Snapshot 6 is out ~~ Partners, please test and let us know if this bug has been fixed. Add PartnerVerified keyword if everything works as expected. For any new issues encountered, CLONE this bug and report the issues in the new bug.

Comment 16 Charlotte Richardson 2008-12-18 15:19:52 UTC
The fixed bonding driver was verified during our extensive release testing cycle with RHEL5.2. Stratus did not have the opportunity to retest with the 5.3 snapshot kernel; however, source inspection of bond_3ad.c in kernel 2.6.18-126 shows that the fix is present. We believe this one is fixed.

Comment 19 errata-xmlrpc 2009-01-20 19:43:56 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Comment 22 Andy Gospodarek 2009-05-29 13:31:50 UTC
*** Bug 502070 has been marked as a duplicate of this bug. ***

Comment 23 Andy Gospodarek 2009-08-12 14:43:15 UTC
*** Bug 516867 has been marked as a duplicate of this bug. ***