Version-Release number of selected component (if applicable): kernel-2.6.18-175.el5.jwltest.95.2.i686.rpm Steps to Reproduce: Use rfkill switch. I can reproduce it on my laptop lenovo T60. Actual results: System become unreliable Additional info: cfg80211_rfkill_set_block() acquire devlist_mtx and call dev_close()->cfg80211_netdev_notifier_call(NETDEV_DOWN) which also want to aquire devlist_mtx. Here is output form debug kernel: ============================================= [ INFO: possible recursive locking detected ] 2.6.18-175.el5.jwltest.95.2debug #1 --------------------------------------------- events/0/8 is trying to acquire lock: (&rdev->devlist_mtx){--..}, at: [<f8a1fb56>] cfg80211_netdev_notifier_call+0x1f9/0x2f8 [cfg80211] but task is already holding lock: (&rdev->devlist_mtx){--..}, at: [<f8a1f10e>] cfg80211_rfkill_set_block+0x1d/0x59 [cfg80211] other info that might help us debug this: 2 locks held by events/0/8: #0: (rtnl_mutex){--..}, at: [<f8a1f0ff>] cfg80211_rfkill_set_block+0xe/0x59 [cfg80211] #1: (&rdev->devlist_mtx){--..}, at: [<f8a1f10e>] cfg80211_rfkill_set_block+0x1d/0x59 [cfg80211] stack backtrace: [<c043c15b>] __lock_acquire+0x76a/0x981 [<c043c633>] lock_acquire+0x5d/0x78 [<f8a1fb56>] cfg80211_netdev_notifier_call+0x1f9/0x2f8 [cfg80211] [<c0628f60>] mutex_lock_nested+0xdf/0x253 [<f8a1fb56>] cfg80211_netdev_notifier_call+0x1f9/0x2f8 [cfg80211] [<f8a1fb56>] cfg80211_netdev_notifier_call+0x1f9/0x2f8 [cfg80211] [<f8a1fb56>] cfg80211_netdev_notifier_call+0x1f9/0x2f8 [cfg80211] [<c0624d25>] packet_notifier+0x139/0x141 [<c062b6b4>] notifier_call_chain+0x19/0x29 [<c05cc13e>] dev_close+0x5d/0x61 [<f8a1f11e>] cfg80211_rfkill_set_block+0x2d/0x59 [cfg80211] [<c04339db>] run_workqueue+0x7e/0xbe [<f8a1f14a>] cfg80211_rfkill_sync_work+0x0/0x13 [cfg80211] [<c04342c8>] worker_thread+0xd9/0x10d [<c0439f26>] lock_release_holdtime+0x25/0x43 [<c041fa1a>] default_wake_function+0x0/0xc [<c04341ef>] worker_thread+0x0/0x10d [<c0436781>] kthread+0xc0/0xeb [<c04366c1>] kthread+0x0/0xeb [<c0405d8f>] kernel_thread_helper+0x7/0x10 ======================= As far I'm not sure how this should be fixed.
OK, I think I caused this in the backport...see the references to cleanup_work in net/wireless/core.c? I commented-out the original code there because I was getting a hang on the call to cancel_work_sync when the device was first brought-up. IIRC I determined that this was due to the work having never been scheduled, and I wasn't sure how to determine that. I'll poke at that...maybe if I can avoid that cancel_work_syn on the first NETDEV_UP then the old code can be used.
(In reply to comment #1) > OK, I think I caused this in the backport...see the references to cleanup_work > in net/wireless/core.c? Yes. I have no clean solution for that as we have no cancel_work_sync() with return value equivalent in RHEL. > I commented-out the original code there because I was getting a hang on the > call to cancel_work_sync when the device was first brought-up. IIRC I > determined that this was due to the work having never been scheduled, and I > wasn't sure how to determine that. I'll poke at that...maybe if I can avoid > that cancel_work_syn on the first NETDEV_UP then the old code can be used. If it can take lots of time for you, I can work on it, since I'm able to reproduce.
Created attachment 374806 [details] jwltest-wireless-cleanup_work.patch Does this avoid the deadlock?
Yes, patch fix the deadlock. Fixed in kernel-2.6.18-175.el5.jwltest.95.3.i686.rpm
Yeah, but it seems to leak a reference to the netdevice, making it impossible to successfully remove the module... I'm going to pull this out of the patches I post today...I'm sure I can find a fix, but may need to work the exception process... :-(
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 375286 [details] jwltest-wireless-cleanup_work.patch This seems to avoid the rfkill lockup _and_ still allow for the netdev to close... :-)
Patch works for me as well. I tested using: for ((i = 0; i < 10; i++)) ; do ifconfig wlan0 down ifconfig wlan0 up done rmmod iwl3945 Without the patch rmmod fail.
QA_ACK 5.5 RHEL Looks like this was caused by enabling other priority hardware. Can't regress functionality. Reproducer is in initial description and we have the hardware.
in kernel-2.6.18-181.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html