Red Hat Bugzilla – Full Text Bug Listing
|Summary:||Kernel panic on ath5k|
|Product:||[Fedora] Fedora||Reporter:||Roy <seven>|
|Component:||kernel||Assignee:||John W. Linville <linville>|
|Status:||CLOSED CURRENTRELEASE||QA Contact:||Fedora Extras Quality Assurance <extras-qa>|
|Version:||13||CC:||anton, dougsland, gansalmon, harnold, itamar, johannes, jonathan, kernel-maint, madhu.chinakonda, me, seandarcy|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2011-01-17 13:00:16 EST||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Roy 2010-08-20 13:36:45 EDT
Created attachment 440000 [details] Crashdump backtrace for described panic Description of problem: Kernel starts panic'ing semi-randomly when logged in. Occurs once or twice a day. Version-Release number of selected component (if applicable): Kernel 18.104.22.168-147.2.4.fc13.i686 How reproducible: Use the system as a work station. Steps to Reproduce: 1. Boot 2. Log-in 3. Be connected to a network Actual results: Kernel panic at random intervals Expected results: No kernel panic at random intervals Additional info: Ethernet controller: 3Com Corporation AR5212 802.11abg NIC (3CRDAG675) (rev 01) AMD Athlon XP 2800+ with 1,5GB ram.
Comment 1 Chuck Ebbert 2010-08-22 10:12:34 EDT
Three warnings, then an oops: WARNING: at net/mac80211/tx.c:644 invoke_tx_handlers+0x6a1/0xa96 [mac80211]() WARNING: at net/mac80211/tx.c:57 ieee80211_duration+0x49/0x1ab [mac80211]() WARNING: at include/net/mac80211.h:1059 ath5k_tx_queue+0x1e9/0x501 [ath5k]() BUG: unable to handle kernel NULL pointer dereference at 00000006 IP: [<fa596570>] ath5k_tx_queue+0x22c/0x501 [ath5k] drivers/net/wireless/ath/ath5k/base.c:1302: hw_rate = (rc_flags & IEEE80211_TX_RC_USE_SHORT_PREAMBLE) ? rate->hw_value_short : rate->hw_value; rate is NULL
Comment 2 John W. Linville 2010-08-24 15:27:58 EDT
Created attachment 440744 [details] 0001-ath5k-check-return-value-of-ieee80211_get_tx_rate.patch At least avoid the NULL dereference at the end...
Comment 3 John W. Linville 2010-08-24 15:42:13 EDT
Test kernel w/ above patch building here: http://koji.fedoraproject.org/koji/taskinfo?taskID=2424195 Does this avoid the NULL pointer dereference? Unfortunately the other warnings will still occur...
Comment 4 Bob Copeland 2010-08-24 17:31:37 EDT
Yeah, I hate this bug (but haven't seen it in a while). What rate controller are you using? If minstrel, looking at rc_stats under /sys/debug/... might be illustrative.
Comment 5 Holger Arnold 2010-08-29 11:01:54 EDT
I have the same problem on my machine (IBM Thinkpad T41p, same WLAN hardware as Roy). Although the oopses seem to occur at random intervals, their probability increases when data is transmitted over multiple "connections". For example, I can trigger this bug with non-negligible probability by downloading large files via bittorrent or by opening many web pages at the same time. This bug seems to be a regression. I never had this problem with kernels up to 2.6.31.*; it started to occur with 2.6.32 (Ubuntu 10.04) and 2.6.33 (Fedora 13). Is there any further information I could provide?
Comment 6 Bob Copeland 2010-08-29 12:52:08 EDT
Yes, provide the aforementioned rc_stats file. I can't remember: in which kernel did minstrel become default?
Comment 7 John W. Linville 2010-08-30 14:13:25 EDT
Comment 8 sean 2010-08-30 18:38:27 EDT
I've have had this or something very similar since 2.6.33-6. Aug 30 15:16:09 daddy kernel: irq 17: nobody cared (try booting with the "irqpoll" option) Aug 30 15:16:09 daddy kernel: Pid: 2996, comm: npviewer.bin Not tainted 22.214.171.124-149.fc13.i686.PAE #1 Aug 30 15:16:09 daddy kernel: Call Trace: Aug 30 15:16:09 daddy kernel: [<c0480d06>] __report_bad_irq+0x2e/0x6f Aug 30 15:16:09 daddy kernel: [<c0480e3c>] note_interrupt+0xf5/0x14d Aug 30 15:16:09 daddy kernel: [<c04813ef>] handle_fasteoi_irq+0x85/0xa4 Aug 30 15:16:09 daddy kernel: [<c040a33b>] handle_irq+0x3b/0x48 Aug 30 15:16:09 daddy kernel: [<c0409bc0>] do_IRQ+0x41/0x9a Aug 30 15:16:09 daddy kernel: [<c0408df0>] common_interrupt+0x30/0x38 Aug 30 15:16:09 daddy kernel: [<c0780000>] ? profile_cpu_callback+0xe7/0x192 Aug 30 15:16:09 daddy kernel: handlers: Aug 30 15:16:09 daddy kernel: [<c0687c63>] (usb_hcd_irq+0x0/0x6a) Aug 30 15:16:09 daddy kernel: [<c0687c63>] (usb_hcd_irq+0x0/0x6a) Aug 30 15:16:09 daddy kernel: [<f8a355af>] (ath5k_intr+0x0/0x1d5 [ath5k]) Aug 30 15:16:09 daddy kernel: Disabling IRQ #17 Aug 30 15:16:26 daddy kernel: ath5k phy0: failed to wakeup the MAC Chip Aug 30 15:16:26 daddy kernel: ath5k phy0: can't reset hardware (-5) ...... Sometimes the NM can restart the chip, but often it leads to a kernel panic. Should I open a new bug? If not, see the attached syslog where it leads to a kernel panic.
Comment 9 sean 2010-08-30 18:42:05 EDT
Created attachment 442054 [details] syslog excerpt: kernel panic after irq 17 disabled
Comment 10 John W. Linville 2010-08-31 11:20:01 EDT
sean, you seem to have a lot of problems in that syslog. I don't think the problem is from ath5k. It looks like your USB controller (or possibly something else) is raising an interrupt that no one knows how to handle. In any case, I think you should file a separate bug.
Comment 11 Holger Arnold 2010-08-31 14:58:24 EDT
Created attachment 442252 [details] sample rc_stats file of minstrel rate controller The rate control algorithm is minstrel. I attached a sample rc_stats file, but I don't think it gives a hint to the cause of the bug. If I understand it correctly, the NULL ptr dereference is a consequence of the array of possible tx rates being empty (= element at index 0 is -1). Is there a situation where this would actually be a valid result?
Comment 12 Holger Arnold 2010-08-31 15:15:19 EDT
I noticed that there are times when this error seems to occur with higher probability than at other times (e.g., 3 times per hour vs. once per day). A possible cause for this behaviour could be the different (number of) other WLANs present at different times. Besides, the problem seems to occur less frequently with the latest kernel in F13 (126.96.36.199-149.fc13.i686) than with the kernel from the F13 release (188.8.131.52-85.fc13.i686). (I don't know how reliable these observations are, the problem occurs not frequently enough to collect sufficient data)
Comment 13 John W. Linville 2010-09-01 15:41:20 EDT
AFAIK, tx rate being "empty" is only valid for hardware that does its own rate scaling. That is not the case for ath5k. rate_control_get_rate initializes those idx values to -1, then exits if IEEE80211_HW_HAS_RATE_CONTROL is set. The rate control get_rate method is called immediately after that. There is some filtering and altering of rates to account for unsupported rates having been selected after that. That would seem like an obvious suspect, except that the end of the function does this: BUG_ON(info->control.rates.idx < 0); That is essentially the same check that was being hit in ieee80211_get_tx_rate. So unless I'm missing something, one of the following would have to be true: -- IEEE80211_HW_HAS_RATE_CONTROL is sometimes set for ath5k (perhaps from memory corruption) -- something can corrupt tx_info->control.rates.idx before ieee80211_get_tx_rate is called -- rate_control_get_rate doesn't always get called in the tx path? -- I'm really missing something?? Copying Johannes in case this provokes his brilliance... :-)
Comment 14 Bob Copeland 2010-09-01 19:11:01 EDT
It reminds me of this one, in that symptom was the same and it only happened if certain environmental circumstances were true, but of course this particular bug should be fixed. Maybe a run with kmemcheck would be useful. commit 5ee58d7e6ad019675b4090582aec4fa1180d8703 Author: Bob Copeland <email@example.com> Date: Fri Jun 5 08:21:50 2009 -0400 mac80211: fix minstrel single-rate memory corruption The minstrel rate controller periodically looks up rate indexes in a sampling table. When accessing a specific row and column, minstrel correctly does a bounds check which, on the surface, appears to handle the case where mi->n_rates < 2. However, mi->sample_idx is actually defined as an unsigned, so the right hand side is taken to be a huge positive number when negative, and the check will always fail. Consequently, the RC will overrun the array and cause random memory corruption when communicating with a peer that has only a single rate. The max value of mi->sample_idx is around 25 so casting to int should have no ill effects. Without the change, uptime is a few minutes under load with an AP that has a single hard-coded rate, and both the AP and STA could potentially crash. With the change, both lasted 12 hours with a steady load. Thanks to Ognjen Maric for providing the single-rate clue so I could reproduce this. This fixes http://bugzilla.kernel.org/show_bug.cgi?id=12490 on the regression list (also http://bugzilla.kernel.org/show_bug.cgi?id=13000). Cc: firstname.lastname@example.org Reported-by: Sergey S. Kostyliov <email@example.com> Reported-by: Ognjen Maric <firstname.lastname@example.org> Signed-off-by: Bob Copeland <email@example.com> Signed-off-by: John W. Linville <firstname.lastname@example.org>
Comment 15 Johannes Berg 2010-09-02 03:28:22 EDT
(In reply to comment #13) > AFAIK, tx rate being "empty" is only valid for hardware that does its own rate > scaling. That is not the case for ath5k. Right. > rate_control_get_rate initializes those idx values to -1, then exits if > IEEE80211_HW_HAS_RATE_CONTROL is set. In fact, it won't even get called when that is set (except for beacons, oddly) > The rate control get_rate method is > called immediately after that. There is some filtering and altering of rates > to account for unsupported rates having been selected after that. That would > seem like an obvious suspect, except that the end of the function does this: > > BUG_ON(info->control.rates.idx < 0); Yeah, that'd be a bug in the rate control algorithm code. > That is essentially the same check that was being hit in ieee80211_get_tx_rate. > So unless I'm missing something, one of the following would have to be true: > > -- IEEE80211_HW_HAS_RATE_CONTROL is sometimes set for ath5k > (perhaps from memory corruption) > > -- something can corrupt tx_info->control.rates.idx before > ieee80211_get_tx_rate is called > > -- rate_control_get_rate doesn't always get called in the tx path? I don't see anything else either. Maybe ath5k uses the driver area already, or something, since in the skb TX info a lot of stuff is unioned?
Comment 16 Roy 2010-09-06 18:18:21 EDT
(In reply to comment #4) > Yeah, I hate this bug (but haven't seen it in a while). > > What rate controller are you using? If minstrel, looking at rc_stats under > /sys/debug/... might be illustrative. Sorry for the delay. Lectures have started again, making my free time shrink to subzero. Could you give me more info on finding the exact rc_stats "file", and finding the right rate controller. I'm hardly familiar with wifi drivers :-). And should I do this from within crash (or gdb... and how?), to get info from the crashed kernel, or just from a regular session? Also it might have disappeared with the latest and greatest (184.108.40.206-47.fc13), which is one of the more frustrating parts of these kind of bugs. It might be there, but I cannot reproduce it in a sensible way.
Comment 17 Bob Copeland 2010-09-07 07:20:44 EDT
The rc_stats file previously posted looked ok, so posting that's probably not necessary. I think using kmemcheck on the crashing kernel would be more illustrative. To do that, though, I think you have to compile your own kernel. John, what about slub debugging: is that enabled in fedora kernels? My guess is this is some kind of memory corruption. The only parts of txinfo struct that ath5k uses are flags and the status struct.
Comment 18 John W. Linville 2010-09-07 09:31:20 EDT
Yes, CONFIG_SLUB_DEBUG=y in Fedora kernels.
Comment 19 Chuck Ebbert 2010-09-08 14:40:20 EDT
Slub debug is there but it's not enabled by default, you need to add a slub_debug=<option> to the kernel command line. I'm not sure what options you'd want there though.
Comment 20 John W. Linville 2010-09-15 14:10:51 EDT
Bob, any thoughts on what we should look at w/ slub debug?
Comment 21 Bob Copeland 2010-09-15 14:24:00 EDT
Well I guess most memory corruptions by walking over the sampling array would show up as red zoning or poison overwritten unless it just crashes outright, so something like slub_debug=FPZ ?
Comment 22 John W. Linville 2011-01-13 14:10:56 EST
Hmmmm...been awhile. Is this still a problem with currently updated Fedora kernels?
Comment 23 Holger Arnold 2011-01-14 01:49:44 EST
For me, this problem became less and less frequent during the development of 2.6.33 and seems to have disappeared with some later 2.6.34 kernel (I have no Fedora system available at the moment, though).