I'm running kernel-220.127.116.11-108.fc9.i686, on a thinkpad T60p laptop, with the iwl3945 wifi driver. I'm creating an ad-hoc network with the following commands :
ifconfig wlan0 down
iwconfig wlan0 mode ad-hoc
ifconfig wlan0 up
iwconfig wlan0 essid xxxxx
iwconfig wlan0 channel 1
iwconfig wlan0 key yyyy-yyyy-yy
ifconfig wlan0 10.42.44.1 netmask 255.255.255.0 up
[some iptables rules to setup masquerading]
With another machine (a nokia N800), I associate to this network. This is not systematic, but *sometimes*, when wireless association completes, the process events/0 starts taking 100% CPU time. The laptop doesn't receive dhcp frames, and nothing happens. After waiting a few seconds, I cancel the association on the nokia N800. Then, events/0 still use 100% CPU time, and the keyboard becomes unresponsive. I can still control the desktop with the mouse only. After waiting approx 1 minute, events/0 then releases the CPU, and I gain back the control of the desktop with the keyboard. No error message in the logs.
This behaviour is not always reproductible. After a fresh boot, it works fine. After a suspend-to-ram/resume (iwl3945 unloaded/reloaded), it may fail in 50% of the cases approx. I could not discriminate a specific use-case that causes the problem to always occur.
If needed I can provide a log with iwl3945 debug enabled.
I captured a log, while the bug occured, with iwl3945 debug log enabled here:
The timeline is:
14.06.10 driver iwl3945 loaded
.22 start ad-hoc shared connection with NetworkManager
.35 associate request from nokia n800
.41 events/0 takes 100%cpu
.54 disassoc request from nokia n800
07.26 events/0 releases cpu
.33 driver unload
(contrary to my initial comment, tcpdump log shows that rx works, but tx doesnt)
Could I persuade you to recreate this with a current rawhide kernel?
Sure. I'm now running 2.6.27-0.244.rc2.git1.fc10.i686, and observed almost the same problem, except that two threads were using 50% cpu time each : events/0 and ksoftirqd/0.
I put new logs at http://bellet.info/iwl3945 with this timeline:
23:14:00 load driver iwl3945
:13 start ad-hoc shared connection with NetworkManager
:28 associate request from nokia n800
:35 events/0 and ksoftirqd/0 take 100%cpu
:45 disassoc request from nokia n800
15:18 events/0 and ksoftirqd/0 release cpu
:23 driver unload
Created attachment 314737 [details]
randomly trigerring 'l' sysrq while events/0 takes 100% cpu time
Back with kernel-18.104.22.168-108.fc9.i686. I tried to dump cpu stack while events/0 was using 100% cpu time, with sysrq, and obtained the attached log. It may be interesting to note that several dumps are located in _spin_unlock_irqrestore in iwl3945_irq_tasklet ?
Created attachment 314738 [details]
log with blocked tasks stack dump, after disassoc request from the peer
This part of the log occurs after the previous one, when the nokia N800 disassociates from the laptop. At this moment, events/0 still uses 100% cpu time for one minute or so, but the desktop is frozen. Most stuck tasks seem to be blocked on acquiring a mutex in rtln_lock. runnig kernel is kernel-22.214.171.124-108.fc9.i686
Rebuilding with CONFIG_MAC80211_DEBUGFS and CONFIG_IWLWIFI_DEBUGFS disabled seems to be much more reliable for me. I'm playing with these options off since an hour, and could not recreate the problem. I'm crossing fingers :)
The code holds the RTNL while adding all STA entries to debugfs, which of course also tries to grab various locks... Not entirely surprising that there is deadlock potential unfortunately. Any chance you could try to narrow this down by commenting out the following lines in sta_info.c in sta_info_debugfs_add_work() just for testing?
If you don't experience the long delays with those commented out, it probably indicates that the core debugfs stuff (like debugfs_create_dir()) is blocking rather than mac80211-specific locks like sta_lock.
With kernel-126.96.36.199-29.fc9.i686, I could not reproduce the bug.
Downgrading back to kernel-188.8.131.52-108.fc9.i686, I could reproduce the bug, with a slightly different behaviour. events/0 now takes 100% cpu time as soon as the configuration of the wlan interface completes. This may be due to the fact that I made my test in an environment with many more radio signals in my neighbourhood, than in August. Rebuilding mac80211 as suggested in comment #7 still causes delays.
I'll try again to reproduce the behaviour with kernel 2.6.26.
Closing on the basis of comment 8...