Bug 625889

Summary:

Kernel panic on ath5k

Product:

[Fedora] Fedora

Reporter:

Roy <nouveau>

Component:

kernel

Assignee:

John W. Linville <linville>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

urgent

Docs Contact:

Priority:

low

Version:

CC:

anton, dougsland, gansalmon, harnold, itamar, johannes, jonathan, kernel-maint, madhu.chinakonda, me, seandarcy

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-01-17 18:00:16 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Crashdump backtrace for described panic	none
0001-ath5k-check-return-value-of-ieee80211_get_tx_rate.patch	none
syslog excerpt: kernel panic after irq 17 disabled	none
sample rc_stats file of minstrel rate controller	none

Description Roy 2010-08-20 17:36:45 UTC

Created attachment 440000 [details]
Crashdump backtrace for described panic

Description of problem:
Kernel starts panic'ing semi-randomly when logged in. Occurs once or twice a day.

Version-Release number of selected component (if applicable):
Kernel 2.6.33.6-147.2.4.fc13.i686

How reproducible:
Use the system as a work station.

Steps to Reproduce:
1. Boot
2. Log-in
3. Be connected to a network
  
Actual results:
Kernel panic at random intervals

Expected results:
No kernel panic at random intervals

Additional info:
Ethernet controller: 3Com Corporation AR5212 802.11abg NIC (3CRDAG675) (rev 01)
AMD Athlon XP 2800+ with 1,5GB ram.

Comment 1 Chuck Ebbert 2010-08-22 14:12:34 UTC

Three warnings, then an oops:

WARNING: at net/mac80211/tx.c:644 invoke_tx_handlers+0x6a1/0xa96 [mac80211]()
WARNING: at net/mac80211/tx.c:57 ieee80211_duration+0x49/0x1ab [mac80211]()
WARNING: at include/net/mac80211.h:1059 ath5k_tx_queue+0x1e9/0x501 [ath5k]()


BUG: unable to handle kernel NULL pointer dereference at 00000006
IP: [<fa596570>] ath5k_tx_queue+0x22c/0x501 [ath5k]

drivers/net/wireless/ath/ath5k/base.c:1302:

        hw_rate = (rc_flags & IEEE80211_TX_RC_USE_SHORT_PREAMBLE) ?
                rate->hw_value_short : rate->hw_value;

rate is NULL

Comment 2 John W. Linville 2010-08-24 19:27:58 UTC

Created attachment 440744 [details]
0001-ath5k-check-return-value-of-ieee80211_get_tx_rate.patch

At least avoid the NULL dereference at the end...

Comment 3 John W. Linville 2010-08-24 19:42:13 UTC

Test kernel w/ above patch building here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=2424195

Does this avoid the NULL pointer dereference?  Unfortunately the other warnings will still occur...

Comment 4 Bob Copeland 2010-08-24 21:31:37 UTC

Yeah, I hate this bug (but haven't seen it in a while).

What rate controller are you using?  If minstrel, looking at rc_stats under /sys/debug/... might be illustrative.

Comment 5 Holger Arnold 2010-08-29 15:01:54 UTC

I have the same problem on my machine (IBM Thinkpad T41p, same WLAN hardware as Roy).  

Although the oopses seem to occur at random intervals, their probability increases when data is transmitted over multiple "connections".  For example, I can trigger this bug with non-negligible probability by downloading large files via bittorrent or by opening many web pages at the same time.

This bug seems to be a regression.  I never had this problem with kernels up to 2.6.31.*; it started to occur with 2.6.32 (Ubuntu 10.04) and 2.6.33 (Fedora 13).

Is there any further information I could provide?

Comment 6 Bob Copeland 2010-08-29 16:52:08 UTC

Yes, provide the aforementioned rc_stats file.

I can't remember: in which kernel did minstrel become default?

Comment 7 John W. Linville 2010-08-30 18:13:25 UTC

2.6.29

Comment 8 sean 2010-08-30 22:38:27 UTC

I've have had this or something very similar since 2.6.33-6.

Aug 30 15:16:09 daddy kernel: irq 17: nobody cared (try booting with the "irqpoll" option)
Aug 30 15:16:09 daddy kernel: Pid: 2996, comm: npviewer.bin Not tainted 2.6.33.8-149.fc13.i686.PAE #1
Aug 30 15:16:09 daddy kernel: Call Trace:
Aug 30 15:16:09 daddy kernel: [<c0480d06>] __report_bad_irq+0x2e/0x6f
Aug 30 15:16:09 daddy kernel: [<c0480e3c>] note_interrupt+0xf5/0x14d
Aug 30 15:16:09 daddy kernel: [<c04813ef>] handle_fasteoi_irq+0x85/0xa4
Aug 30 15:16:09 daddy kernel: [<c040a33b>] handle_irq+0x3b/0x48
Aug 30 15:16:09 daddy kernel: [<c0409bc0>] do_IRQ+0x41/0x9a
Aug 30 15:16:09 daddy kernel: [<c0408df0>] common_interrupt+0x30/0x38
Aug 30 15:16:09 daddy kernel: [<c0780000>] ? profile_cpu_callback+0xe7/0x192
Aug 30 15:16:09 daddy kernel: handlers:
Aug 30 15:16:09 daddy kernel: [<c0687c63>] (usb_hcd_irq+0x0/0x6a)
Aug 30 15:16:09 daddy kernel: [<c0687c63>] (usb_hcd_irq+0x0/0x6a)
Aug 30 15:16:09 daddy kernel: [<f8a355af>] (ath5k_intr+0x0/0x1d5 [ath5k])
Aug 30 15:16:09 daddy kernel: Disabling IRQ #17
Aug 30 15:16:26 daddy kernel: ath5k phy0: failed to wakeup the MAC Chip
Aug 30 15:16:26 daddy kernel: ath5k phy0: can't reset hardware (-5)
......

Sometimes the NM can restart the chip, but often it leads to a kernel panic.

Should I open a new bug? If not, see the attached syslog where it leads to a kernel panic.

Comment 9 sean 2010-08-30 22:42:05 UTC

Created attachment 442054 [details]
syslog excerpt: kernel panic after irq 17 disabled

Comment 10 John W. Linville 2010-08-31 15:20:01 UTC

sean, you seem to have a lot of problems in that syslog.  I don't think the problem is from ath5k.  It looks like your USB controller (or possibly something else) is raising an interrupt that no one knows how to handle.  In any case, I think you should file a separate bug.

Comment 11 Holger Arnold 2010-08-31 18:58:24 UTC

Created attachment 442252 [details]
sample rc_stats file of minstrel rate controller

The rate control algorithm is minstrel.  I attached a sample rc_stats file, but I don't think it gives a hint to the cause of the bug.

If I understand it correctly, the NULL ptr dereference is a consequence of the array of possible tx rates being empty (= element at index 0 is -1).  Is there a situation where this would actually be a valid result?

Comment 12 Holger Arnold 2010-08-31 19:15:19 UTC

I noticed that there are times when this error seems to occur with higher probability than at other times (e.g., 3 times per hour vs. once per day).  A possible cause for this behaviour could be the different (number of) other WLANs present at different times. 

Besides, the problem seems to occur less frequently with the latest kernel in F13 (2.6.33.8-149.fc13.i686) than with the kernel from the F13 release (2.6.33.3-85.fc13.i686).

(I don't know how reliable these observations are, the problem occurs not frequently enough to collect sufficient data)

Comment 13 John W. Linville 2010-09-01 19:41:20 UTC

AFAIK, tx rate being "empty" is only valid for hardware that does its own rate scaling.  That is not the case for ath5k.

rate_control_get_rate initializes those idx values to -1, then exits if IEEE80211_HW_HAS_RATE_CONTROL is set.  The rate control get_rate method is called immediately after that.  There is some filtering and altering of rates to account for unsupported rates having been selected after that.  That would seem like an obvious suspect, except that the end of the function does this:

   BUG_ON(info->control.rates[0].idx < 0);

That is essentially the same check that was being hit in ieee80211_get_tx_rate.  So unless I'm missing something, one of the following would have to be true:

   -- IEEE80211_HW_HAS_RATE_CONTROL is sometimes set for ath5k
      (perhaps from memory corruption)

   -- something can corrupt tx_info->control.rates[0].idx before
      ieee80211_get_tx_rate is called

   -- rate_control_get_rate doesn't always get called in the tx path?

   -- I'm really missing something??

Copying Johannes in case this provokes his brilliance... :-)

Comment 14 Bob Copeland 2010-09-01 23:11:01 UTC

It reminds me of this one, in that symptom was the same and it only happened if certain environmental circumstances were true, but of course this particular bug should be fixed.  Maybe a run with kmemcheck would be useful.

commit 5ee58d7e6ad019675b4090582aec4fa1180d8703
Author: Bob Copeland <me>
Date:   Fri Jun 5 08:21:50 2009 -0400

    mac80211: fix minstrel single-rate memory corruption
    
    The minstrel rate controller periodically looks up rate indexes in
    a sampling table.  When accessing a specific row and column, minstrel
    correctly does a bounds check which, on the surface, appears to handle
    the case where mi->n_rates < 2.  However, mi->sample_idx is actually
    defined as an unsigned, so the right hand side is taken to be a huge
    positive number when negative, and the check will always fail.
    
    Consequently, the RC will overrun the array and cause random memory
    corruption when communicating with a peer that has only a single rate.
    The max value of mi->sample_idx is around 25 so casting to int should
    have no ill effects.
    
    Without the change, uptime is a few minutes under load with an AP
    that has a single hard-coded rate, and both the AP and STA could
    potentially crash.  With the change, both lasted 12 hours with a
    steady load.
    
    Thanks to Ognjen Maric for providing the single-rate clue so I could
    reproduce this.
    
    This fixes http://bugzilla.kernel.org/show_bug.cgi?id=12490 on the
    regression list (also http://bugzilla.kernel.org/show_bug.cgi?id=13000).
    
    Cc: stable
    Reported-by: Sergey S. Kostyliov <rathamahata>
    Reported-by: Ognjen Maric <ognjen.maric>
    Signed-off-by: Bob Copeland <me>
    Signed-off-by: John W. Linville <linville>

Comment 15 Johannes Berg 2010-09-02 07:28:22 UTC

(In reply to comment #13)
> AFAIK, tx rate being "empty" is only valid for hardware that does its own rate
> scaling.  That is not the case for ath5k.

Right.

> rate_control_get_rate initializes those idx values to -1, then exits if
> IEEE80211_HW_HAS_RATE_CONTROL is set.

In fact, it won't even get called when that is set (except for beacons, oddly)

> The rate control get_rate method is
> called immediately after that.  There is some filtering and altering of rates
> to account for unsupported rates having been selected after that.  That would
> seem like an obvious suspect, except that the end of the function does this:
> 
>    BUG_ON(info->control.rates[0].idx < 0);

Yeah, that'd be a bug in the rate control algorithm code.

> That is essentially the same check that was being hit in ieee80211_get_tx_rate.
>  So unless I'm missing something, one of the following would have to be true:
> 
>    -- IEEE80211_HW_HAS_RATE_CONTROL is sometimes set for ath5k
>       (perhaps from memory corruption)
> 
>    -- something can corrupt tx_info->control.rates[0].idx before
>       ieee80211_get_tx_rate is called
> 
>    -- rate_control_get_rate doesn't always get called in the tx path?

I don't see anything else either. Maybe ath5k uses the driver area already, or something, since in the skb TX info a lot of stuff is unioned?

Comment 16 Roy 2010-09-06 22:18:21 UTC

(In reply to comment #4)
> Yeah, I hate this bug (but haven't seen it in a while).
> 
> What rate controller are you using?  If minstrel, looking at rc_stats under
> /sys/debug/... might be illustrative.

Sorry for the delay. Lectures have started again, making my free time shrink to subzero.
Could you give me more info on finding the exact rc_stats "file", and finding the right rate controller. I'm hardly familiar with wifi drivers :-). And should I do this from within crash (or gdb... and how?), to get info from the crashed kernel, or just from a regular session?
Also it might have disappeared with the latest and greatest (2.6.34.6-47.fc13), which is one of the more frustrating parts of these kind of bugs. It might be there, but I cannot reproduce it in a sensible way.

Comment 17 Bob Copeland 2010-09-07 11:20:44 UTC

The rc_stats file previously posted looked ok, so posting that's probably not necessary.  I think using kmemcheck on the crashing kernel would be more illustrative.  To do that, though, I think you have to compile your own kernel.

John, what about slub debugging: is that enabled in fedora kernels?

My guess is this is some kind of memory corruption.  The only parts of txinfo struct that ath5k uses are flags and the status struct.

Comment 18 John W. Linville 2010-09-07 13:31:20 UTC

Yes, CONFIG_SLUB_DEBUG=y in Fedora kernels.

Comment 19 Chuck Ebbert 2010-09-08 18:40:20 UTC

Slub debug is there but it's not enabled by default, you need to add a slub_debug=<option> to the kernel command line. I'm not sure what options you'd want there though.

Comment 20 John W. Linville 2010-09-15 18:10:51 UTC

Bob, any thoughts on what we should look at w/ slub debug?

Comment 21 Bob Copeland 2010-09-15 18:24:00 UTC

Well I guess most memory corruptions by walking over the sampling array would show up as red zoning or poison overwritten unless it just crashes outright, so something like slub_debug=FPZ ?

Comment 22 John W. Linville 2011-01-13 19:10:56 UTC

Hmmmm...been awhile.  Is this still a problem with currently updated Fedora kernels?

Comment 23 Holger Arnold 2011-01-14 06:49:44 UTC

For me, this problem became less and less frequent during the development of 2.6.33 and seems to have disappeared with some later 2.6.34 kernel (I have no Fedora system available at the moment, though).

Comment 24 John W. Linville 2011-01-17 18:00:16 UTC

OK, thanks for the update...closing on the basis of comment 23...