Bug 235956
Summary: | rt2500 Inconsistent lock state | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Martin Kristensen <martin> | ||||||||||
Component: | kernel | Assignee: | John W. Linville <linville> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | rawhide | CC: | cebbert, davej, fkooman, ivdoorn, mwallis | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | 2.6.21-1.3163.fc7 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2007-12-10 20:34:18 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Martin Kristensen
2007-04-11 02:09:53 UTC
Created attachment 152191 [details]
dmesg from the affected system.
Created attachment 152192 [details]
Output from lspci -vvv from the affected system.
Although it's not obvious, Fedora 7 test release bugs should be filed against version "devel" rather than "test#". I'm moving these so they don't get lost. This has been a bulk message; we now return to your scheduled programming. Thank you. Matthew, I will keep that in mind. Thanks. The issue persists in the latest kernel: ================================= [ INFO: inconsistent lock state ] 2.6.20-1.3066.fc7 #1 --------------------------------- inconsistent {hardirq-on-W} -> {in-hardirq-R} usage. NetworkManager/2722 [HC1[1]:SC0[0]:HE0:SE1] takes: (dev_base_lock){-.?-}, at: [<ffffffff804111c6>] dev_get_by_index+0x17/0x40 {hardirq-on-W} state was registered at: [<ffffffff802a3185>] __lock_acquire+0x491/0xbd1 [<ffffffff802a3cbb>] lock_acquire+0x4c/0x65 [<ffffffff804104a3>] register_netdevice+0x2c2/0x366 [<ffffffff80263b32>] _write_lock_bh+0x2a/0x36 [<ffffffff804104a3>] register_netdevice+0x2c2/0x366 [<ffffffff80410581>] register_netdev+0x3a/0x48 [<ffffffff806a2f10>] loopback_init+0x10/0x12 [<ffffffff80686ab9>] init+0x219/0x316 [<ffffffff802a2871>] trace_hardirqs_on+0x136/0x15a [<ffffffff8025cfa8>] child_rip+0xa/0x12 [<ffffffff80263d40>] _spin_unlock_irq+0x2b/0x31 [<ffffffff8025c6bc>] restore_args+0x0/0x30 [<ffffffff806868a0>] init+0x0/0x316 [<ffffffff8025cf9e>] child_rip+0x0/0x12 [<ffffffffffffffff>] 0xffffffffffffffff irq event stamp: 270364 hardirqs last enabled at (270363): [<ffffffff80263db6>] _spin_unlock_irqrestore+0x3f/0x47 hardirqs last disabled at (270364): [<ffffffff8026355e>] trace_hardirqs_off_thunk+0x35/0x67 softirqs last enabled at (270342): [<ffffffff80211d34>] __do_softirq+0xda/0xe3 softirqs last disabled at (270333): [<ffffffff8025d31c>] call_softirq+0x1c/0x28 other info that might help us debug this: 1 lock held by NetworkManager/2722: #0: (rtnl_mutex){--..}, at: [<ffffffff802622fe>] mutex_lock+0x2a/0x2e stack backtrace: Call Trace: <IRQ> [<ffffffff802a1ad7>] print_usage_bug+0x162/0x173 [<ffffffff802a237b>] mark_lock+0x1a2/0x442 [<ffffffff802a30d1>] __lock_acquire+0x3dd/0xbd1 [<ffffffff802a3cbb>] lock_acquire+0x4c/0x65 [<ffffffff804111c6>] dev_get_by_index+0x17/0x40 [<ffffffff80263bd6>] _read_lock+0x28/0x34 [<ffffffff804111c6>] dev_get_by_index+0x17/0x40 [<ffffffff881cbb92>] :mac80211:ieee80211_beacon_get+0x23/0x49d [<ffffffff802a266e>] mark_held_locks+0x53/0x7a [<ffffffff882165dc>] :rt2500pci:rt2500pci_interrupt+0x74/0x15d [<ffffffff80210753>] handle_IRQ_event+0x20/0x55 [<ffffffff802bd8ac>] handle_fasteoi_irq+0x9d/0xdd [<ffffffff8026bf7f>] do_IRQ+0xf6/0x166 [<ffffffff8025c666>] ret_from_intr+0x0/0xf <EOI> [<ffffffff80418353>] linkwatch_fire_event+0x69/0xd2 [<ffffffff80298f53>] queue_work+0x43/0x51 [<ffffffff8029946a>] queue_delayed_work+0x34/0xaf [<ffffffff80299511>] schedule_delayed_work+0x2c/0x31 [<ffffffff804183b7>] linkwatch_fire_event+0xcd/0xd2 [<ffffffff8041e89f>] netif_carrier_off+0x15/0x17 [<ffffffff881ce3a4>] :mac80211:ieee80211_open+0x30d/0x33b [<ffffffff80410d55>] dev_open+0x37/0x79 [<ffffffff8040f459>] dev_change_flags+0x5d/0x122 [<ffffffff80417b7f>] rtnl_setlink+0x2b7/0x3f6 [<ffffffff80263db6>] _spin_unlock_irqrestore+0x3f/0x47 [<ffffffff80417323>] rtnetlink_rcv_msg+0x1c2/0x1e5 [<ffffffff80424adf>] netlink_run_queue+0x72/0xfa [<ffffffff80417161>] rtnetlink_rcv_msg+0x0/0x1e5 [<ffffffff8041710c>] rtnetlink_rcv+0x32/0x50 [<ffffffff80424ef0>] netlink_data_ready+0x1a/0x59 [<ffffffff804240b6>] netlink_sendskb+0x29/0x44 [<ffffffff804247ec>] netlink_unicast+0x1e8/0x207 [<ffffffff80424ec3>] netlink_sendmsg+0x2b1/0x2c4 [<ffffffff80253afc>] sock_sendmsg+0x105/0x124 [<ffffffff80230f4b>] sock_recvmsg+0x116/0x137 [<ffffffff8029c443>] autoremove_wake_function+0x0/0x38 [<ffffffff802a2871>] trace_hardirqs_on+0x136/0x15a [<ffffffff80263d40>] _spin_unlock_irq+0x2b/0x31 [<ffffffff804080f6>] move_addr_to_kernel+0x40/0x49 [<ffffffff8040d4de>] verify_iovec+0x4f/0x91 [<ffffffff80407efa>] sys_sendmsg+0x1e1/0x253 [<ffffffff8029e8ba>] up_read+0x26/0x2a [<ffffffff8026619d>] do_page_fault+0x47a/0x7ed [<ffffffff802bb15b>] audit_syscall_entry+0x148/0x17e [<ffffffff8026cc40>] syscall_trace_enter+0x9a/0x9f [<ffffffff8025c2b5>] tracesys+0xdc/0xe1 Still the applies: ================================= [ INFO: inconsistent lock state ] 2.6.20-1.3069.fc7 #1 --------------------------------- inconsistent {hardirq-on-W} -> {in-hardirq-R} usage. NetworkManager/2715 [HC1[1]:SC0[0]:HE0:SE1] takes: (dev_base_lock){-.?-}, at: [<ffffffff804111ec>] dev_get_by_index+0x17/0x40 {hardirq-on-W} state was registered at: [<ffffffff802a31a2>] __lock_acquire+0x491/0xbd1 [<ffffffff802a3cd8>] lock_acquire+0x4c/0x65 [<ffffffff804104c9>] register_netdevice+0x2c2/0x366 [<ffffffff80263b32>] _write_lock_bh+0x2a/0x36 [<ffffffff804104c9>] register_netdevice+0x2c2/0x366 [<ffffffff804105a7>] register_netdev+0x3a/0x48 [<ffffffff806a2f10>] loopback_init+0x10/0x12 [<ffffffff80686ab9>] init+0x219/0x316 [<ffffffff802a288e>] trace_hardirqs_on+0x136/0x15a [<ffffffff8025cfa8>] child_rip+0xa/0x12 [<ffffffff80263d40>] _spin_unlock_irq+0x2b/0x31 [<ffffffff8025c6bc>] restore_args+0x0/0x30 [<ffffffff806868a0>] init+0x0/0x316 [<ffffffff8025cf9e>] child_rip+0x0/0x12 [<ffffffffffffffff>] 0xffffffffffffffff irq event stamp: 270658 hardirqs last enabled at (270657): [<ffffffff80263db6>] _spin_unlock_irqrestore+0x3f/0x47 hardirqs last disabled at (270658): [<ffffffff8026355e>] trace_hardirqs_off_thunk+0x35/0x67 softirqs last enabled at (270634): [<ffffffff80211d34>] __do_softirq+0xda/0xe3 softirqs last disabled at (270577): [<ffffffff8025d31c>] call_softirq+0x1c/0x28 other info that might help us debug this: 1 lock held by NetworkManager/2715: #0: (rtnl_mutex){--..}, at: [<ffffffff802622fe>] mutex_lock+0x2a/0x2e stack backtrace: Call Trace: <IRQ> [<ffffffff802a1af4>] print_usage_bug+0x162/0x173 [<ffffffff802a2398>] mark_lock+0x1a2/0x442 [<ffffffff802a30ee>] __lock_acquire+0x3dd/0xbd1 [<ffffffff802a3cd8>] lock_acquire+0x4c/0x65 [<ffffffff804111ec>] dev_get_by_index+0x17/0x40 [<ffffffff80263bd6>] _read_lock+0x28/0x34 [<ffffffff804111ec>] dev_get_by_index+0x17/0x40 [<ffffffff8818fb92>] :mac80211:ieee80211_beacon_get+0x23/0x49d [<ffffffff802a268b>] mark_held_locks+0x53/0x7a [<ffffffff8029a2f7>] __rcu_process_callbacks+0x187/0x1f3 [<ffffffff8029a2f7>] __rcu_process_callbacks+0x187/0x1f3 [<ffffffff881c05dc>] :rt2500pci:rt2500pci_interrupt+0x74/0x15d [<ffffffff80210753>] handle_IRQ_event+0x20/0x55 [<ffffffff802bd8c9>] handle_fasteoi_irq+0x9d/0xdd [<ffffffff8026bf7f>] do_IRQ+0xf6/0x166 [<ffffffff8025c666>] ret_from_intr+0x0/0xf <EOI> [<ffffffff80418379>] linkwatch_fire_event+0x69/0xd2 [<ffffffff80263d03>] _read_lock_irq+0x29/0x3b [<ffffffff80263db9>] _spin_unlock_irqrestore+0x42/0x47 [<ffffffff804183a1>] linkwatch_fire_event+0x91/0xd2 [<ffffffff8041e8c5>] netif_carrier_off+0x15/0x17 [<ffffffff881923a4>] :mac80211:ieee80211_open+0x30d/0x33b [<ffffffff80410d7b>] dev_open+0x37/0x79 [<ffffffff8040f47f>] dev_change_flags+0x5d/0x122 [<ffffffff80417ba5>] rtnl_setlink+0x2b7/0x3f6 [<ffffffff80263db6>] _spin_unlock_irqrestore+0x3f/0x47 [<ffffffff80417349>] rtnetlink_rcv_msg+0x1c2/0x1e5 [<ffffffff80424b05>] netlink_run_queue+0x72/0xfa [<ffffffff80417187>] rtnetlink_rcv_msg+0x0/0x1e5 [<ffffffff80417132>] rtnetlink_rcv+0x32/0x50 [<ffffffff80424f16>] netlink_data_ready+0x1a/0x59 [<ffffffff804240dc>] netlink_sendskb+0x29/0x44 [<ffffffff80424812>] netlink_unicast+0x1e8/0x207 [<ffffffff80424ee9>] netlink_sendmsg+0x2b1/0x2c4 [<ffffffff80253afc>] sock_sendmsg+0x105/0x124 [<ffffffff80230f4b>] sock_recvmsg+0x116/0x137 [<ffffffff8029c460>] autoremove_wake_function+0x0/0x38 [<ffffffff802a288e>] trace_hardirqs_on+0x136/0x15a [<ffffffff80263d40>] _spin_unlock_irq+0x2b/0x31 [<ffffffff8040811c>] move_addr_to_kernel+0x40/0x49 [<ffffffff8040d504>] verify_iovec+0x4f/0x91 [<ffffffff80407f20>] sys_sendmsg+0x1e1/0x253 [<ffffffff8029e8d7>] up_read+0x26/0x2a [<ffffffff8026619d>] do_page_fault+0x47a/0x7ed [<ffffffff802bb178>] audit_syscall_entry+0x148/0x17e [<ffffffff8026cc40>] syscall_trace_enter+0x9a/0x9f [<ffffffff8025c2b5>] tracesys+0xdc/0xe1 Another one for the experts... :-) Could you enable debugging output of the rt2500pci driver (load rt2500pci with the debug=1 option) and attach /var/log/messsages to this bugreport? *** Bug 238537 has been marked as a duplicate of this bug. *** ok my estimation of the bug: Since this is a locking issue, the problem is not occuring inside rt2500pci, (rt2x00 doesn't use spinlocking/mutexes etc.). The problem seems to be that ieee80211_beacon_get() is called in interrupt context and somewhere along that line the bug occurs. But what I also notice is that within the Stacktrace, mac80211_open() is present which suggests that the interrupt might have fired while the radio is disabled. This needs further investigation, if you could provide the requested debug information it would be most useful. Soory for the delay in getting back, with the information you asked for. I am not sure if I did this correctly, but I added debug=1 to the ra2500pci line in the /etc/modprobe.conf file. The output from /var/log/messages is attached. Created attachment 154004 [details]
Debug output from ra2500pci
Created attachment 154092 [details]
git-wireless-dev-rt2x00-locking-fix.patch
Look for this patch in rawhide kernel >= 1.3132...it would be nice if we can
teach rt2x00 not to call ieee80211_beacon_get in hardirq context...
well the tx/rx/beacon done code was all moved out of workqueue some weeks ago because they had a too negative impact on the performance due to the context switching. So if they should be moved out of irq context again, it would be probably be required if they will run with some higher priority. Perhaps that tasklets would be a better option? on the other hand, the ieee80211_rx and ieee80211_tx have special _irq versions, so perhaps a ieee80211_beacon_get_irq would be an option? Or would it be best if as much as possible really moved towards takslets or something similar? Kernels w/ patch from comment 12 available here: http://koji.fedoraproject.org/packages/kernel/2.6.21/1.3132.fc7/ Martin, please try to recreate the issue using those kernels, and post the results here...thanks! Ivo, changing to a tasklet doesn't change the problem much because we would still need a new (or custom) version of dev_get_by_index but for _bh instead of _irq. And having ieee80211_beacon_get_irq doesn't seem very appealing because it just means maintaining more code. How big a performance hit was it to use workqueues instead? Well doing txdone and rxdone inside a workqueue will slow down the average transfer rate with 50% (as reported by Jan Kiszka). Since the beacondone handler is time specific (the interrupt is triggered when the new beacon should be send to the device), I am not sure if it should wait too long before responing (i.e. Does switching context a few times for responding to the interrupt not take too much time?). Note the workqueue vs no-workqueue tests were done with all interrupts in workqueues and nothing in workqueues. So although the scheduled beacondone will most likely have no effect on the avarege transfer rates, the performance hit rxdone and txdone seem to have could indicate beacondone might run into problems as well. (Master mode and adhoc mode weren't really working when everything was handled in workqueues, but that could also be caused by several bugs that have now been fixed). kernel: kernel-2.6.21-1.3139.fc7 (from koji) The problem is still there: pccard: CardBus card inserted into slot 1 Loading module: rt2x00lib - CVS (N/A) by http://rt2x00.serialmonkey.com. Loading module: rt2x00pci - CVS (N/A) by http://rt2x00.serialmonkey.com. Loading module: rt2500pci - CVS (N/A) by http://rt2x00.serialmonkey.com. PCI: Enabling device 0000:06:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:06:00.0[A] -> Link [LNKB] -> GSI 10 (level, low) -> IRQ 10 PCI: Setting latency timer of device 0000:06:00.0 to 64 wmaster0: Selected rate control algorithm 'simple' ================================= [ INFO: inconsistent lock state ] 2.6.21-1.3139.fc7 #1 --------------------------------- inconsistent {hardirq-on-W} -> {in-hardirq-R} usage. NetworkManager/2171 [HC1[1]:SC0[0]:HE0:SE1] takes: (dev_base_lock){-.?-}, at: [<d0e907f0>] ieee80211_beacon_get+0x1b/0x42c [mac80211] {hardirq-on-W} state was registered at: [<c04424b2>] __lock_acquire+0x448/0xba4 [<c0443000>] lock_acquire+0x56/0x6f [<c0616b0b>] _write_lock_bh+0x30/0x3d [<c05b3cb2>] register_netdevice+0x278/0x300 [<c05b5045>] register_netdev+0x40/0x4d [<c075a553>] loopback_init+0xd/0xf [<c073d699>] init+0x1f1/0x2d6 [<c0405cd3>] kernel_thread_helper+0x7/0x10 [<ffffffff>] 0xffffffff irq event stamp: 20658 hardirqs last enabled at (20657): [<c04050c0>] restore_nocheck+0x12/0x15 hardirqs last disabled at (20658): [<c0405a50>] common_interrupt+0x24/0x34 softirqs last enabled at (20656): [<c042bfe0>] __do_softirq+0xdc/0xe2 softirqs last disabled at (20651): [<c0407420>] do_softirq+0x61/0xd0 other info that might help us debug this: 1 lock held by NetworkManager/2171: #0: (rtnl_mutex){--..}, at: [<c061578a>] mutex_lock+0x21/0x24 stack backtrace: [<c04061e9>] show_trace_log_lvl+0x1a/0x2f [<c04067ad>] show_trace+0x12/0x14 [<c0406831>] dump_stack+0x16/0x18 [<c0440f99>] print_usage_bug+0x141/0x14b [<c044177d>] mark_lock+0x18d/0x419 [<c04423e8>] __lock_acquire+0x37e/0xba4 [<c0443000>] lock_acquire+0x56/0x6f [<c0616d02>] _read_lock_irq+0x31/0x3e [<d0e907f0>] ieee80211_beacon_get+0x1b/0x42c [mac80211] [<d0e5b716>] rt2x00pci_beacondone+0x32/0x56 [rt2x00pci] [<d0e74091>] rt2500pci_interrupt+0x49/0x203 [rt2500pci] [<c0458c52>] handle_IRQ_event+0x1a/0x46 [<c045a15f>] handle_level_irq+0x90/0xd9 [<c0407540>] do_IRQ+0xb1/0xd9 [<c0405a5a>] common_interrupt+0x2e/0x34 [<c05b573b>] dev_open+0x2e/0x66 [<c05b409d>] dev_change_flags+0x51/0xf1 [<c05bc488>] rtnl_setlink+0x265/0x37d [<c05bba9a>] rtnetlink_rcv_msg+0x1bb/0x1de [<c05c8139>] netlink_run_queue+0x65/0xd1 [<c05bb895>] rtnetlink_rcv+0x29/0x42 [<c05c84d7>] netlink_data_ready+0x15/0x56 [<c05c75a9>] netlink_sendskb+0x1f/0x37 [<c05c7e94>] netlink_unicast+0x1ab/0x1c5 [<c05c84b6>] netlink_sendmsg+0x271/0x27d [<c05ab577>] sock_sendmsg+0xe7/0x104 [<c05ab6e5>] sys_sendmsg+0x151/0x1af [<c05ac6e1>] sys_socketcall+0x220/0x241 [<c0405078>] syscall_call+0x7/0xb ======================= wlan0: Initial auth_alg=0 wlan0: authenticate with AP 00:0f:66:5b:1c:e2 wlan0: RX authentication from 00:0f:66:5b:1c:e2 (alg=0 transaction=2 status=0) wlan0: authenticated wlan0: associate with AP 00:0f:66:5b:1c:e2 wlan0: RX AssocResp from 00:0f:66:5b:1c:e2 (capab=0x411 status=0 aid=1) wlan0: associated eth0: link down But it works now with my accesspoint with WPA2 (wiee!), and pretty fast too! (2.5MByte/s from wired machine to this wireless machine) I don't see this problem anymore, now running kernel-2.6.21-1.3163.fc7. Next rt2x00 (2.0.7) release will refresh the beacons inside scheduled context so the workaround will not be needed anymore after that. rt2x00 2.0.7 has been released containing a fix for this issue |