Bug 457300 - hang in ad_rx_machine due to second attempt to lock spin_lock
hang in ad_rx_machine due to second attempt to lock spin_lock
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.2
All Linux
high Severity high
: rc
: ---
Assigned To: Andy Gospodarek
Martin Jenner
:
: 502070 516867 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-07-30 15:15 EDT by Charlotte Richardson
Modified: 2014-06-29 19:00 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 14:43:56 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Charlotte Richardson 2008-07-30 15:15:14 EDT
Description of problem:
A system hung in ad_rx_machine. From the stack traces, we found that
ad_rx_machine took a lock using a plain spin_lock, got interrupted, ran
softirq's, and the same code path ran again and tried to take the same lock. We
found that upstream code (kernel 2.6.26) uses spin_lock_bh and spin_unlock_bh
instead. This appears to fix the problem here, though since it was not
reproducible on demand that is hard to prove. The fix is in
drivers/net/bonding/bond_3ad.c to change inline routine __get_rx_machine_lock to
do spin_lock_bh instead of spin_lock and inline routine
__release_rx_machine_lock to do spin_unlock_bh instead of spin_unlock.


Version-Release number of selected component (if applicable):
2.6.18 kernel (bonding.ko driver)


How reproducible:
Does not reproduce on demand.


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Andy Gospodarek 2008-08-19 14:16:12 EDT
Thanks for filing this.  I plan to pull an update to the bonding driver for the next rhel update so I will post here when there are test kernels available that contain this fix.  This upstream commit is the one we need to make sure to include:

commit 2bf86b7aa8e74bf81a9872f7b610f49b610a4649
Author: Jay Vosburgh <fubar@us.ibm.com>
Date:   Fri Mar 21 22:29:33 2008 -0700

    bonding: Fix locking in 802.3ad mode

        The 802.3ad state machine lock can be acquired in both softirq and
    not softirq context, but was not held at _bh to prevent a deadlock (which
    could occur if a LACPDU arrived and was processed while the lock was
    held).

        Corrected this, now hold the state machine lock at _bh to prevent
    deadlock.

        Bug reported by Todd Fleisher <todd@fleish.org>.

    Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
    Signed-off-by: Jeff Garzik <jeff@garzik.org>
Comment 5 RHEL Product and Program Management 2008-09-02 15:45:01 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 9 Andy Gospodarek 2008-09-08 23:10:00 EDT
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.  Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.
Comment 10 Charlotte Richardson 2008-09-09 11:58:31 EDT
It looks OK (though the original problem was not reproducible on demand anyway, of course).

We had been occluding the normal bonding.ko file with one that has this fix in it already.
Comment 11 Don Zickus 2008-09-11 15:43:43 EDT
in kernel-2.6.18-111.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 14 Chris Ward 2008-11-28 02:15:17 EST
Partners, this bug should be fixed in the latest RHEL 5.3 Snapshot. We believe that you have some interest in its correct functionality, so we're making a friendly request to send us some testing feedback. 

If you have a chance to test it, please share with us your findings. If you have successfully VERIFIED the fix, please add PartnerVerified to the Bugzilla keywords, along with a description of the results. Thanks!
Comment 15 Chris Ward 2008-12-18 05:39:18 EST
~~ Snapshot 6 is out ~~ Partners, please test and let us know if this bug has been fixed. Add PartnerVerified keyword if everything works as expected. For any new issues encountered, CLONE this bug and report the issues in the new bug.
Comment 16 Charlotte Richardson 2008-12-18 10:19:52 EST
The fixed bonding driver was verified during our extensive release testing cycle with RHEL5.2. Stratus did not have the opportunity to retest with the 5.3 snapshot kernel; however, source inspection of bond_3ad.c in kernel 2.6.18-126 shows that the fix is present. We believe this one is fixed.
Comment 19 errata-xmlrpc 2009-01-20 14:43:56 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html
Comment 22 Andy Gospodarek 2009-05-29 09:31:50 EDT
*** Bug 502070 has been marked as a duplicate of this bug. ***
Comment 23 Andy Gospodarek 2009-08-12 10:43:15 EDT
*** Bug 516867 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.