Red Hat Bugzilla – Bug 625889
Kernel panic on ath5k
Last modified: 2011-01-17 13:00:16 EST
Created attachment 440000 [details]
Crashdump backtrace for described panic
Description of problem:
Kernel starts panic'ing semi-randomly when logged in. Occurs once or twice a day.
Version-Release number of selected component (if applicable):
Use the system as a work station.
Steps to Reproduce:
3. Be connected to a network
Kernel panic at random intervals
No kernel panic at random intervals
Ethernet controller: 3Com Corporation AR5212 802.11abg NIC (3CRDAG675) (rev 01)
AMD Athlon XP 2800+ with 1,5GB ram.
Three warnings, then an oops:
WARNING: at net/mac80211/tx.c:644 invoke_tx_handlers+0x6a1/0xa96 [mac80211]()
WARNING: at net/mac80211/tx.c:57 ieee80211_duration+0x49/0x1ab [mac80211]()
WARNING: at include/net/mac80211.h:1059 ath5k_tx_queue+0x1e9/0x501 [ath5k]()
BUG: unable to handle kernel NULL pointer dereference at 00000006
IP: [<fa596570>] ath5k_tx_queue+0x22c/0x501 [ath5k]
hw_rate = (rc_flags & IEEE80211_TX_RC_USE_SHORT_PREAMBLE) ?
rate->hw_value_short : rate->hw_value;
rate is NULL
Created attachment 440744 [details]
At least avoid the NULL dereference at the end...
Test kernel w/ above patch building here:
Does this avoid the NULL pointer dereference? Unfortunately the other warnings will still occur...
Yeah, I hate this bug (but haven't seen it in a while).
What rate controller are you using? If minstrel, looking at rc_stats under /sys/debug/... might be illustrative.
I have the same problem on my machine (IBM Thinkpad T41p, same WLAN hardware as Roy).
Although the oopses seem to occur at random intervals, their probability increases when data is transmitted over multiple "connections". For example, I can trigger this bug with non-negligible probability by downloading large files via bittorrent or by opening many web pages at the same time.
This bug seems to be a regression. I never had this problem with kernels up to 2.6.31.*; it started to occur with 2.6.32 (Ubuntu 10.04) and 2.6.33 (Fedora 13).
Is there any further information I could provide?
Yes, provide the aforementioned rc_stats file.
I can't remember: in which kernel did minstrel become default?
I've have had this or something very similar since 2.6.33-6.
Aug 30 15:16:09 daddy kernel: irq 17: nobody cared (try booting with the "irqpoll" option)
Aug 30 15:16:09 daddy kernel: Pid: 2996, comm: npviewer.bin Not tainted 18.104.22.168-149.fc13.i686.PAE #1
Aug 30 15:16:09 daddy kernel: Call Trace:
Aug 30 15:16:09 daddy kernel: [<c0480d06>] __report_bad_irq+0x2e/0x6f
Aug 30 15:16:09 daddy kernel: [<c0480e3c>] note_interrupt+0xf5/0x14d
Aug 30 15:16:09 daddy kernel: [<c04813ef>] handle_fasteoi_irq+0x85/0xa4
Aug 30 15:16:09 daddy kernel: [<c040a33b>] handle_irq+0x3b/0x48
Aug 30 15:16:09 daddy kernel: [<c0409bc0>] do_IRQ+0x41/0x9a
Aug 30 15:16:09 daddy kernel: [<c0408df0>] common_interrupt+0x30/0x38
Aug 30 15:16:09 daddy kernel: [<c0780000>] ? profile_cpu_callback+0xe7/0x192
Aug 30 15:16:09 daddy kernel: handlers:
Aug 30 15:16:09 daddy kernel: [<c0687c63>] (usb_hcd_irq+0x0/0x6a)
Aug 30 15:16:09 daddy kernel: [<c0687c63>] (usb_hcd_irq+0x0/0x6a)
Aug 30 15:16:09 daddy kernel: [<f8a355af>] (ath5k_intr+0x0/0x1d5 [ath5k])
Aug 30 15:16:09 daddy kernel: Disabling IRQ #17
Aug 30 15:16:26 daddy kernel: ath5k phy0: failed to wakeup the MAC Chip
Aug 30 15:16:26 daddy kernel: ath5k phy0: can't reset hardware (-5)
Sometimes the NM can restart the chip, but often it leads to a kernel panic.
Should I open a new bug? If not, see the attached syslog where it leads to a kernel panic.
Created attachment 442054 [details]
syslog excerpt: kernel panic after irq 17 disabled
sean, you seem to have a lot of problems in that syslog. I don't think the problem is from ath5k. It looks like your USB controller (or possibly something else) is raising an interrupt that no one knows how to handle. In any case, I think you should file a separate bug.
Created attachment 442252 [details]
sample rc_stats file of minstrel rate controller
The rate control algorithm is minstrel. I attached a sample rc_stats file, but I don't think it gives a hint to the cause of the bug.
If I understand it correctly, the NULL ptr dereference is a consequence of the array of possible tx rates being empty (= element at index 0 is -1). Is there a situation where this would actually be a valid result?
I noticed that there are times when this error seems to occur with higher probability than at other times (e.g., 3 times per hour vs. once per day). A possible cause for this behaviour could be the different (number of) other WLANs present at different times.
Besides, the problem seems to occur less frequently with the latest kernel in F13 (22.214.171.124-149.fc13.i686) than with the kernel from the F13 release (126.96.36.199-85.fc13.i686).
(I don't know how reliable these observations are, the problem occurs not frequently enough to collect sufficient data)
AFAIK, tx rate being "empty" is only valid for hardware that does its own rate scaling. That is not the case for ath5k.
rate_control_get_rate initializes those idx values to -1, then exits if IEEE80211_HW_HAS_RATE_CONTROL is set. The rate control get_rate method is called immediately after that. There is some filtering and altering of rates to account for unsupported rates having been selected after that. That would seem like an obvious suspect, except that the end of the function does this:
BUG_ON(info->control.rates.idx < 0);
That is essentially the same check that was being hit in ieee80211_get_tx_rate. So unless I'm missing something, one of the following would have to be true:
-- IEEE80211_HW_HAS_RATE_CONTROL is sometimes set for ath5k
(perhaps from memory corruption)
-- something can corrupt tx_info->control.rates.idx before
ieee80211_get_tx_rate is called
-- rate_control_get_rate doesn't always get called in the tx path?
-- I'm really missing something??
Copying Johannes in case this provokes his brilliance... :-)
It reminds me of this one, in that symptom was the same and it only happened if certain environmental circumstances were true, but of course this particular bug should be fixed. Maybe a run with kmemcheck would be useful.
Author: Bob Copeland <email@example.com>
Date: Fri Jun 5 08:21:50 2009 -0400
mac80211: fix minstrel single-rate memory corruption
The minstrel rate controller periodically looks up rate indexes in
a sampling table. When accessing a specific row and column, minstrel
correctly does a bounds check which, on the surface, appears to handle
the case where mi->n_rates < 2. However, mi->sample_idx is actually
defined as an unsigned, so the right hand side is taken to be a huge
positive number when negative, and the check will always fail.
Consequently, the RC will overrun the array and cause random memory
corruption when communicating with a peer that has only a single rate.
The max value of mi->sample_idx is around 25 so casting to int should
have no ill effects.
Without the change, uptime is a few minutes under load with an AP
that has a single hard-coded rate, and both the AP and STA could
potentially crash. With the change, both lasted 12 hours with a
Thanks to Ognjen Maric for providing the single-rate clue so I could
This fixes http://bugzilla.kernel.org/show_bug.cgi?id=12490 on the
regression list (also http://bugzilla.kernel.org/show_bug.cgi?id=13000).
Reported-by: Sergey S. Kostyliov <firstname.lastname@example.org>
Reported-by: Ognjen Maric <email@example.com>
Signed-off-by: Bob Copeland <firstname.lastname@example.org>
Signed-off-by: John W. Linville <email@example.com>
(In reply to comment #13)
> AFAIK, tx rate being "empty" is only valid for hardware that does its own rate
> scaling. That is not the case for ath5k.
> rate_control_get_rate initializes those idx values to -1, then exits if
> IEEE80211_HW_HAS_RATE_CONTROL is set.
In fact, it won't even get called when that is set (except for beacons, oddly)
> The rate control get_rate method is
> called immediately after that. There is some filtering and altering of rates
> to account for unsupported rates having been selected after that. That would
> seem like an obvious suspect, except that the end of the function does this:
> BUG_ON(info->control.rates.idx < 0);
Yeah, that'd be a bug in the rate control algorithm code.
> That is essentially the same check that was being hit in ieee80211_get_tx_rate.
> So unless I'm missing something, one of the following would have to be true:
> -- IEEE80211_HW_HAS_RATE_CONTROL is sometimes set for ath5k
> (perhaps from memory corruption)
> -- something can corrupt tx_info->control.rates.idx before
> ieee80211_get_tx_rate is called
> -- rate_control_get_rate doesn't always get called in the tx path?
I don't see anything else either. Maybe ath5k uses the driver area already, or something, since in the skb TX info a lot of stuff is unioned?
(In reply to comment #4)
> Yeah, I hate this bug (but haven't seen it in a while).
> What rate controller are you using? If minstrel, looking at rc_stats under
> /sys/debug/... might be illustrative.
Sorry for the delay. Lectures have started again, making my free time shrink to subzero.
Could you give me more info on finding the exact rc_stats "file", and finding the right rate controller. I'm hardly familiar with wifi drivers :-). And should I do this from within crash (or gdb... and how?), to get info from the crashed kernel, or just from a regular session?
Also it might have disappeared with the latest and greatest (188.8.131.52-47.fc13), which is one of the more frustrating parts of these kind of bugs. It might be there, but I cannot reproduce it in a sensible way.
The rc_stats file previously posted looked ok, so posting that's probably not necessary. I think using kmemcheck on the crashing kernel would be more illustrative. To do that, though, I think you have to compile your own kernel.
John, what about slub debugging: is that enabled in fedora kernels?
My guess is this is some kind of memory corruption. The only parts of txinfo struct that ath5k uses are flags and the status struct.
Yes, CONFIG_SLUB_DEBUG=y in Fedora kernels.
Slub debug is there but it's not enabled by default, you need to add a slub_debug=<option> to the kernel command line. I'm not sure what options you'd want there though.
Bob, any thoughts on what we should look at w/ slub debug?
Well I guess most memory corruptions by walking over the sampling array would show up as red zoning or poison overwritten unless it just crashes outright, so something like slub_debug=FPZ ?
Hmmmm...been awhile. Is this still a problem with currently updated Fedora kernels?
For me, this problem became less and less frequent during the development of 2.6.33 and seems to have disappeared with some later 2.6.34 kernel (I have no Fedora system available at the moment, though).
OK, thanks for the update...closing on the basis of comment 23...