Bug 542593 - recursive lock of devlist_mtx
Summary: recursive lock of devlist_mtx
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: 5.5
Assignee: John W. Linville
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 526948
TreeView+ depends on / blocked
 
Reported: 2009-11-30 09:55 UTC by Stanislaw Gruszka
Modified: 2010-03-30 07:44 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 07:44:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
jwltest-wireless-cleanup_work.patch (1.58 KB, patch)
2009-11-30 16:26 UTC, John W. Linville
no flags Details | Diff
jwltest-wireless-cleanup_work.patch (2.69 KB, patch)
2009-12-02 02:52 UTC, John W. Linville
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description Stanislaw Gruszka 2009-11-30 09:55:25 UTC
Version-Release number of selected component (if applicable):
kernel-2.6.18-175.el5.jwltest.95.2.i686.rpm

Steps to Reproduce:
Use rfkill switch. I can reproduce it on my laptop lenovo T60.
  
Actual results:
System become unreliable

Additional info:

cfg80211_rfkill_set_block() acquire devlist_mtx and call dev_close()->cfg80211_netdev_notifier_call(NETDEV_DOWN)
which also want to aquire devlist_mtx. Here is output form
debug kernel:

=============================================
[ INFO: possible recursive locking detected ]
2.6.18-175.el5.jwltest.95.2debug #1
---------------------------------------------
events/0/8 is trying to acquire lock:
 (&rdev->devlist_mtx){--..}, at: [<f8a1fb56>] cfg80211_netdev_notifier_call+0x1f9/0x2f8 [cfg80211]

but task is already holding lock:
 (&rdev->devlist_mtx){--..}, at: [<f8a1f10e>] cfg80211_rfkill_set_block+0x1d/0x59 [cfg80211]

other info that might help us debug this:
2 locks held by events/0/8:
 #0:  (rtnl_mutex){--..}, at: [<f8a1f0ff>] cfg80211_rfkill_set_block+0xe/0x59 [cfg80211]
 #1:  (&rdev->devlist_mtx){--..}, at: [<f8a1f10e>] cfg80211_rfkill_set_block+0x1d/0x59 [cfg80211]

stack backtrace:
 [<c043c15b>] __lock_acquire+0x76a/0x981
 [<c043c633>] lock_acquire+0x5d/0x78
 [<f8a1fb56>] cfg80211_netdev_notifier_call+0x1f9/0x2f8 [cfg80211]
 [<c0628f60>] mutex_lock_nested+0xdf/0x253
 [<f8a1fb56>] cfg80211_netdev_notifier_call+0x1f9/0x2f8 [cfg80211]
 [<f8a1fb56>] cfg80211_netdev_notifier_call+0x1f9/0x2f8 [cfg80211]
 [<f8a1fb56>] cfg80211_netdev_notifier_call+0x1f9/0x2f8 [cfg80211]
 [<c0624d25>] packet_notifier+0x139/0x141
 [<c062b6b4>] notifier_call_chain+0x19/0x29
 [<c05cc13e>] dev_close+0x5d/0x61
 [<f8a1f11e>] cfg80211_rfkill_set_block+0x2d/0x59 [cfg80211]
 [<c04339db>] run_workqueue+0x7e/0xbe
 [<f8a1f14a>] cfg80211_rfkill_sync_work+0x0/0x13 [cfg80211]
 [<c04342c8>] worker_thread+0xd9/0x10d
 [<c0439f26>] lock_release_holdtime+0x25/0x43
 [<c041fa1a>] default_wake_function+0x0/0xc
 [<c04341ef>] worker_thread+0x0/0x10d
 [<c0436781>] kthread+0xc0/0xeb
 [<c04366c1>] kthread+0x0/0xeb
 [<c0405d8f>] kernel_thread_helper+0x7/0x10
 =======================

As far I'm not sure how this should be fixed.

Comment 1 John W. Linville 2009-11-30 14:31:44 UTC
OK, I think I caused this in the backport...see the references to cleanup_work in net/wireless/core.c?

I commented-out the original code there because I was getting a hang on the call to cancel_work_sync when the device was first brought-up.  IIRC I determined that this was due to the work having never been scheduled, and I wasn't sure how to determine that.  I'll poke at that...maybe if I can avoid that cancel_work_syn on the first NETDEV_UP then the old code can be used.

Comment 2 Stanislaw Gruszka 2009-11-30 15:42:46 UTC
(In reply to comment #1)
> OK, I think I caused this in the backport...see the references to cleanup_work
> in net/wireless/core.c?

Yes. I have no clean solution for that as we have no cancel_work_sync() with return value equivalent in RHEL.

> I commented-out the original code there because I was getting a hang on the
> call to cancel_work_sync when the device was first brought-up.  IIRC I
> determined that this was due to the work having never been scheduled, and I
> wasn't sure how to determine that.  I'll poke at that...maybe if I can avoid
> that cancel_work_syn on the first NETDEV_UP then the old code can be used.  

If it can take lots of time for you, I can work on it, since I'm able to reproduce.

Comment 3 John W. Linville 2009-11-30 16:26:32 UTC
Created attachment 374806 [details]
jwltest-wireless-cleanup_work.patch

Does this avoid the deadlock?

Comment 4 Stanislaw Gruszka 2009-12-01 14:22:19 UTC
Yes, patch fix the deadlock.

Fixed in kernel-2.6.18-175.el5.jwltest.95.3.i686.rpm

Comment 5 John W. Linville 2009-12-01 16:28:51 UTC
Yeah, but it seems to leak a reference to the netdevice, making it impossible to successfully remove the module...

I'm going to pull this out of the patches I post today...I'm sure I can find a fix, but may need to work the exception process... :-(

Comment 6 RHEL Program Management 2009-12-01 19:54:03 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 John W. Linville 2009-12-02 02:52:14 UTC
Created attachment 375286 [details]
jwltest-wireless-cleanup_work.patch

This seems to avoid the rfkill lockup _and_ still allow for the netdev to close... :-)

Comment 8 Stanislaw Gruszka 2009-12-02 10:14:53 UTC
Patch works for me as well. I tested using:

for ((i = 0; i < 10; i++)) ; do
        ifconfig wlan0 down
        ifconfig wlan0 up
done
rmmod iwl3945

Without the patch rmmod fail.

Comment 11 Cameron Meadors 2009-12-09 15:26:16 UTC
QA_ACK 5.5 RHEL

Looks like this was caused by enabling other priority hardware.  Can't regress functionality.  Reproducer is in initial description and we have the hardware.

Comment 12 Don Zickus 2009-12-15 20:19:44 UTC
in kernel-2.6.18-181.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 21 errata-xmlrpc 2010-03-30 07:44:30 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html


Note You need to log in before you can comment on or make changes to this bug.