Bug 459571 - iwl3945 + ad-hoc assoc causes events/0 to take 100% CPU time
iwl3945 + ad-hoc assoc causes events/0 to take 100% CPU time
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
9
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: John W. Linville
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-08-20 05:45 EDT by Fabrice Bellet
Modified: 2008-10-23 13:33 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-10-23 13:33:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
randomly trigerring 'l' sysrq while events/0 takes 100% cpu time (23.38 KB, text/plain)
2008-08-21 14:05 EDT, Fabrice Bellet
no flags Details
log with blocked tasks stack dump, after disassoc request from the peer (25.85 KB, text/plain)
2008-08-21 14:11 EDT, Fabrice Bellet
no flags Details

  None (edit)
Description Fabrice Bellet 2008-08-20 05:45:57 EDT
I'm running kernel-2.6.25.14-108.fc9.i686, on a thinkpad T60p laptop, with the iwl3945 wifi driver. I'm creating an ad-hoc network with the following commands :

ifconfig wlan0 down
iwconfig wlan0 mode ad-hoc
ifconfig wlan0 up
iwconfig wlan0 essid xxxxx
iwconfig wlan0 channel 1
iwconfig wlan0 key yyyy-yyyy-yy
ifconfig wlan0 10.42.44.1 netmask 255.255.255.0 up
[some iptables rules to setup masquerading]
[dnsmasq invocation]

With another machine (a nokia N800), I associate to this network. This is not systematic, but *sometimes*, when wireless association completes, the process events/0 starts taking 100% CPU time. The laptop doesn't receive dhcp frames, and nothing happens. After waiting a few seconds, I cancel the association on the nokia N800. Then, events/0 still use 100% CPU time, and the keyboard becomes unresponsive. I can still control the desktop with the mouse only. After waiting approx 1 minute, events/0 then releases the CPU, and I gain back the control of the desktop with the keyboard. No error message in the logs.

This behaviour is not always reproductible. After a fresh boot, it works fine. After a suspend-to-ram/resume (iwl3945 unloaded/reloaded), it may fail in 50% of the cases approx. I could not discriminate a specific use-case that causes the problem to always occur.

If needed I can provide a log with iwl3945 debug enabled.
Comment 1 Fabrice Bellet 2008-08-20 08:35:24 EDT
I captured a log, while the bug occured,  with iwl3945 debug log enabled here:

http://bellet.info/iwl3945

The timeline is:

14.06.10 driver iwl3945 loaded
     .22 start ad-hoc shared connection with NetworkManager
     .35 associate request from nokia n800
     .41 events/0 takes 100%cpu
     .54 disassoc request from nokia n800
   07.26 events/0 releases cpu
     .33 driver unload

(contrary to my initial comment, tcpdump log shows that rx works, but tx doesnt)
Comment 2 John W. Linville 2008-08-20 11:32:40 EDT
Could I persuade you to recreate this with a current rawhide kernel?
Comment 3 Fabrice Bellet 2008-08-20 17:40:10 EDT
Sure. I'm now running 2.6.27-0.244.rc2.git1.fc10.i686, and observed almost the same problem, except that two threads were using 50% cpu time each : events/0 and ksoftirqd/0.

I put new logs at http://bellet.info/iwl3945 with this timeline:

23:14:00 load driver iwl3945
     :13 start ad-hoc shared connection with NetworkManager
     :28 associate request from nokia n800
     :35 events/0 and ksoftirqd/0 take 100%cpu
     :45 disassoc request from nokia n800
   15:18 events/0 and ksoftirqd/0 release cpu
     :23 driver unload
Comment 4 Fabrice Bellet 2008-08-21 14:05:04 EDT
Created attachment 314737 [details]
randomly trigerring 'l' sysrq while events/0 takes 100% cpu time

Back with kernel-2.6.25.14-108.fc9.i686. I tried to dump cpu stack while events/0 was using 100% cpu time, with sysrq, and obtained the attached log. It may be interesting to note that several dumps are located in _spin_unlock_irqrestore in iwl3945_irq_tasklet ?
Comment 5 Fabrice Bellet 2008-08-21 14:11:13 EDT
Created attachment 314738 [details]
log with blocked tasks stack dump, after disassoc request from the peer

This part of the log occurs after the previous one, when the nokia N800 disassociates from the laptop. At this moment, events/0 still uses 100% cpu time for one minute or so, but the desktop is frozen. Most stuck tasks seem to be blocked on acquiring a mutex in rtln_lock. runnig kernel is kernel-2.6.25.14-108.fc9.i686
Comment 6 Fabrice Bellet 2008-08-21 17:27:44 EDT
Rebuilding with CONFIG_MAC80211_DEBUGFS and CONFIG_IWLWIFI_DEBUGFS disabled seems to be much more reliable for me. I'm playing with these options off since an hour, and could not recreate the problem. I'm crossing fingers :)
Comment 7 Dan Williams 2008-09-21 17:27:56 EDT
The code holds the RTNL while adding all STA entries to debugfs, which of course also tries to grab various locks...  Not entirely surprising that there is deadlock potential unfortunately.  Any chance you could try to narrow this down by commenting out the following lines in sta_info.c in sta_info_debugfs_add_work() just for testing?

		ieee80211_sta_debugfs_add(sta);
		rate_control_add_sta_debugfs(sta);

If you don't experience the long delays with those commented out, it probably indicates that the core debugfs stuff (like debugfs_create_dir()) is blocking rather than mac80211-specific locks like sta_lock.
Comment 8 Fabrice Bellet 2008-09-23 04:53:39 EDT
With kernel-2.6.26.3-29.fc9.i686, I could not reproduce the bug.

Downgrading back to kernel-2.6.25.14-108.fc9.i686, I could reproduce the bug, with a slightly different behaviour. events/0 now takes 100% cpu time as soon as the configuration of the wlan interface completes. This may be due to the fact that I made my test in an environment with many more radio signals in my neighbourhood, than in August. Rebuilding mac80211 as suggested in comment #7 still causes delays.

I'll try again to reproduce the behaviour with kernel 2.6.26.
Comment 9 John W. Linville 2008-10-23 13:33:50 EDT
Closing on the basis of comment 8...

Note You need to log in before you can comment on or make changes to this bug.