Bug 731672

Summary: (rt2x00) Kernel 2.6.40.3-0.fc15.x86-64 causes kernel panic
Product: [Fedora] Fedora Reporter: Dirk Foerster <dirk.foerster>
Component: kernelAssignee: Stanislaw Gruszka <sgruszka>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 15CC: aquini, gansalmon, gareth.k.jones, gwingerde, ibmalone, itamar, ivdoorn, james, jl-icase, jonathan, kernel-maint, madhu.chinakonda, ribenakid, sgruszka, siemons, tomastrnka, urilabob
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.41.4-1.fc15 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-11 15:04:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Extract from /var/log/messages
none
Kernel message as on-screen
none
lspci -vvxxx output
none
Photos of kernel panic message after system freeze
none
Panic log including subsequent fallout
none
Proposed patch none

Description Dirk Foerster 2011-08-18 10:13:42 UTC
Description of problem:
Kernel 2.6.40.3-0.fc15.x86-64 causes kernel panic and switching back to text mode with system freeze

Version-Release number of selected component (if applicable):
Kernel 2.6.40.3-0.fc15.x86-64

How reproducible:
Updated kernel and used Firefox when crash occurred

Steps to Reproduce:
1.Update Kernel
2.Use Firefox
3.
  
Actual results:
Kernel panic, switch back to text mode and freeze of system

Expected results:
Normal system performance

Additional info:
Previous kernel version caused kernel panic, too.
Using now kernel 2.6.38.8-35.fc15.x86_64 which is running OK.

Comment 1 Gareth Jones 2011-08-18 20:53:33 UTC
I also see this or similar behaviour with both 2.6.40 kernel updates, but not 2.6.38.

Often the system doesn't drop to text mode, or the screen is scrambled with panic text overlaid.  I also get a panic if I login in text-mode, but it usually takes longer.

The most common panic is "BUG: unable to handle kernel NULL pointer dereference", although sometimes it's "Not syncing".

/var/log/messages contains over 300 MiB of such text:
"BUG: scheduling while atomic: swapper"
"bad: scheduling from the idle thread!"

My system: Fedora 15 x86-64, Intel Core i7 940, Asus P6T Deluxe motherboard, NVIDIA 9800 GT graphics with default up-to-date nouveau driver, untainted kernel.

(Screen-shots and log extracts available on request.)

Gareth

Comment 2 josip@icase.edu 2011-08-19 12:59:03 UTC
Probably related to bug 732008 which has a crash screen picture (that bug occurs immediately on boot, within one second)

Comment 3 Dave Jones 2011-08-24 19:07:37 UTC
> (Screen-shots and log extracts available on request.)

We really need only the first one from each session.
A lot of the time an oops or similar warning will occur, and then the kernel state is so messed up that 'follow-up' oopses happen afterwards that aren't really useful.

Comment 4 Gareth Jones 2011-08-24 22:41:52 UTC
Created attachment 519736 [details]
Extract from /var/log/messages

Extract from /var/log/messages, starting just before the first kernel oops (as far as I can tell).  This and similar errors then repeat for 350 MiB.

Comment 5 Gareth Jones 2011-08-24 22:44:59 UTC
Created attachment 519737 [details]
Kernel message as on-screen

This is the sort of thing I see on the screen when the error happens - if I'm lucky enough to see anything on-screen at all: often it just freezes.

Comment 6 Roland Siemons 2011-08-27 15:59:08 UTC
Had the same troubles. My temporary solution:
Install "Kernel 2.6.40.3-0.fc15.x86-64.debug" e.g. by using yumextender.
It worked. No idea why though ...

Comment 7 Dave Jones 2011-09-01 16:59:43 UTC
there's a lot going on here, but my gut is telling me this is related to the wireless driver (if only because it's the only prominent thing in the stack traces).

Maybe John has some clues..

Dirk, does this look like the same thing you're seeing ?

Comment 8 Gareth Jones 2011-09-01 17:26:03 UTC
That sounds reasonable.  I've noticed a single panic with 2.6.38 which seemed to be rt2500-related.

I'll attach my lspci -vvxxx output in case it helps.

If there's a general feeling that this is related to the rt2500 driver I can ask on the FedoraForums to see if anyone else is having similar problems.

Gareth

Comment 9 Gareth Jones 2011-09-01 17:28:15 UTC
Created attachment 521064 [details]
lspci -vvxxx output

The rt2500 hardware is near the end of the file.

Comment 10 John W. Linville 2011-09-01 17:36:06 UTC
Looks like it could relate to power saving mode.  You might try this:

   iw dev wlan0 set power_save off

Does that change the issue?

Comment 11 Gareth Jones 2011-09-01 18:18:28 UTC
I added "iw ..." to my /etc/rc.local (I didn't want to bet on having time to type it manually).

So far, so good - 25 mins uptime, which is *considerably* more than I've ever managed with 2.6.40 before.

It's pub time for me now, but I'll report back when I've got some serious usage out of it tomorrow.

Gareth

Comment 12 Gareth Jones 2011-09-04 15:24:52 UTC
Okay, just to confirm, the suggested iw work-around makes 2.6.40 usable for me.

Comment 13 Dirk Foerster 2011-09-04 17:05:05 UTC
(In reply to comment #7)
> there's a lot going on here, but my gut is telling me this is related to the
> wireless driver (if only because it's the only prominent thing in the stack
> traces).
> 
> Maybe John has some clues..
> 
> Dirk, does this look like the same thing you're seeing ?

I've made some photos of the kernel panic message since the system freezes and doesn't allow for screen shots.

Comment 14 Dirk Foerster 2011-09-04 17:16:14 UTC
Created attachment 521389 [details]
Photos of kernel panic message after system freeze

Comment 15 Gareth Jones 2011-09-06 17:57:42 UTC
Dirk and Roland, are you using the rt2500 wireless driver (or similar)?  If so, does John's command stop the kernel panics?

Also take a look at your /var/log/messages, as what appears on screen is probably the end result of the problem, not necessarily the start.

That'd be the easiest way to see if you're affected by the same bug as me, whether it's more complicated, or even multiple different bugs.

Gareth

Comment 16 Dirk Foerster 2011-09-07 16:30:01 UTC
(In reply to comment #15)
> Dirk and Roland, are you using the rt2500 wireless driver (or similar)?  If so,
> does John's command stop the kernel panics?
> 
> Also take a look at your /var/log/messages, as what appears on screen is
> probably the end result of the problem, not necessarily the start.
> 
> That'd be the easiest way to see if you're affected by the same bug as me,
> whether it's more complicated, or even multiple different bugs.
> 
> Gareth

Using rt2500pci   rt2x00lib

John's command seems to stop the kernel panic. System is up one hour now. Without

iw dev wlan0 set power_save off

panic occurred after a couple of minutes.

Regarding /var/log/messages not sure which part might be of any interest now.

Comment 17 Gareth Jones 2011-09-07 17:20:36 UTC
Okay, ditto, I think that's enough to be sure we're seeing the same bug.

Comment 18 Roland Siemons 2011-09-08 08:46:22 UTC
(In reply to comment #15)
> Dirk and Roland, are you using the rt2500 wireless driver (or similar)?  If so,
> does John's command stop the kernel panics?
> 
> Also take a look at your /var/log/messages, as what appears on screen is
> probably the end result of the problem, not necessarily the start.
> 
> That'd be the easiest way to see if you're affected by the same bug as me,
> whether it's more complicated, or even multiple different bugs.
> 
> Gareth

Gareth,
That PC has no wireless. Therefore, I suppose it has no wireselss drivers installed. Or could it?

Furher, I cannot read the screen so fast. If you need exact info about what happens, please tell me which file to read. And also how to access it, because ... since a while Fedora has changed so that I can no longer operate root in graphical mode. As a dillettant, I find it very difficult to keep this operating system in the air now. Would you have an advice about this?

Roland

Comment 19 Tomáš Trnka 2011-10-19 06:58:43 UTC
Hello,

I'm seeing the rt2500pci powersave-related panic reproducibly (the kernel panics a few seconds after enabling powersave). I've got a core via kdump on 2.6.40.6 (can provide the whole 2 GiB file if anyone's interested). I'm attaching the panic log, but the most important part is:

[  853.871216]  [<ffffffff810620a6>] msleep+0x1b/0x22
[  853.871230]  [<ffffffffa0314423>] rt2500pci_set_device_state+0x840/0x8a0 [rt2500pci]
[  853.871241]  [<ffffffffa0314ac7>] rt2500pci_config+0x297/0x2bd [rt2500pci]
[  853.871257]  [<ffffffffa02c319e>] rt2x00lib_config+0x144/0x22a [rt2x00lib]
[  853.871269]  [<ffffffffa02c1355>] rt2x00lib_rxdone+0x2a9/0x37b [rt2x00lib]
[  853.871280]  [<ffffffffa02d455e>] rt2x00pci_rxdone+0x76/0x8b [rt2x00pci]
[  853.871290]  [<ffffffffa03144b3>] rt2500pci_rxdone_tasklet+0x14/0x59 [rt2500pci]

Apparently, rt2500pci_set_device_state check whether the requested state change has succeeded in a loop and doesn't care that it could be called from rt2500pci_rxdone_tasklet (after receiving a beacon the tasklet orders to return to powersaving mode if there's no traffic queued):

rt2500pci.c:1212

        /*
         * Device is not guaranteed to be in the requested state yet.
         * We must wait until the register indicates that the
         * device has entered the correct state.
         */
        for (i = 0; i < REGISTER_BUSY_COUNT; i++) {
                rt2x00pci_register_read(rt2x00dev, PWRCSR1, &reg2);
                bbp_state = rt2x00_get_field32(reg2, PWRCSR1_BBP_CURR_STATE);
                rf_state = rt2x00_get_field32(reg2, PWRCSR1_RF_CURR_STATE);
                if (bbp_state == state && rf_state == state)
                        return 0;
                rt2x00pci_register_write(rt2x00dev, PWRCSR1, reg);
                msleep(10);
        }


So powersave works only for people with devices fast enough to switch state instantly (before the CPU gets to the inner if check). Everyone else steps on the msleep and explodes in softirq context.

"Quick fix": Either drop the msleep() and let it spin a bit or check whether in interrupt and completely skip the loop in that case.

Tip: Since rc.local is sooo pre-systemd era (and putting powersave off there is not reliable, too, since the wifi could come up and panic before rc.local is executed), the best is to do add a simple rule to /etc/udev/rules.d to have powersaving off from the very start:

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="rt2500pci", KERNEL=="wlan*", RUN="/sbin/iw $name set power_save off"

Comment 20 Tomáš Trnka 2011-10-19 07:00:19 UTC
Created attachment 528933 [details]
Panic log including subsequent fallout

Comment 21 John W. Linville 2011-10-19 21:14:17 UTC
Ivo, any thoughts on comment 19?

Comment 22 James 2011-10-20 20:13:58 UTC
I think I'm encountering this in F16 on a laptop, running off the Live USB with kernel 3.1.0-0.rc6.git0.3.fc16.i686. Connect to a wireless network, enter the password, and it panics. Disabling power management via iw seems to do the trick... I'll attach the machine's particulars (backtraces look similar) if anyone thinks they'll be of use.

Comment 23 Ivo van Doorn 2011-10-23 20:02:11 UTC
(In reply to comment #21)
> Ivo, any thoughts on comment 19?

Sounds like a very valid point, I guess it was introduced when we were moving the interrupts from process to IRQ context back and forth.. :(

Comment 24 Stanislaw Gruszka 2011-10-24 13:51:35 UTC
Helmut works (or plan to work) on that:
http://marc.info/?l=linux-wireless&m=131702522217100&w=2

Comment 25 Gertjan van Wingerde 2011-10-31 22:23:33 UTC
Created attachment 531046 [details]
Proposed patch

Please check if the attached patch fixes the issue for you.

Comment 26 Dirk Foerster 2011-11-01 12:36:43 UTC
(In reply to comment #25)
> Created attachment 531046 [details]
> Proposed patch
> 
> Please check if the attached patch fixes the issue for you.

How would I do that?

Comment 27 Stanislaw Gruszka 2011-11-01 12:49:26 UTC
I'll prepare kernel build with the patch.

Comment 28 Stanislaw Gruszka 2011-11-01 15:01:41 UTC
Here is kernel build with patch from comment 25, please test when it finish to compile:
http://koji.fedoraproject.org/koji/taskinfo?taskID=3478051

Comment 29 Ian Malone 2011-11-02 00:33:44 UTC
Thanks for that, ran out of space when trying to do my own kernel build (9GB not enough!)

kernel-2.6.40.6-0.fc15.x86_64
(Current stable I think) has this crash for me when not using set power_save off.

kernel-2.6.40.8-3.bz731672.fc15.x86_64
Hasn't crashed yet at ~ 1/2hour uptime.

Comment 30 Tomáš Trnka 2011-11-02 07:53:02 UTC
(In reply to comment #25)
> Created attachment 531046 [details]
> Proposed patch
> 
> Please check if the attached patch fixes the issue for you.

Indeed it does, thanks.
kernel-2.6.40.8-2.fc15.x86_64 - crashes within a few seconds after enabling powersave

kernel-2.6.40.8-2.rhbz731672.fc15.x86_64 - can't reproduce the crash, power management works as expected now (judging from the "STA will sleep" flag being set in transmitted frames)

Comment 31 John W. Linville 2011-11-09 18:07:16 UTC
So, when will we see this patch upstream? :-)

Comment 32 Gertjan van Wingerde 2011-11-09 18:17:05 UTC
As soon as I have returned from my travels I will submit the patch upstream (sorry travels came in between that) ;-)

Comment 33 Gareth Jones 2011-11-09 18:41:09 UTC
I'm currently updating to F16, but I'll give the patch a test when I get a chance (if it isn't already included by then!).

Comment 34 Ian Malone 2011-11-13 10:11:01 UTC
I'm not sure this will require a new bug, but while I no longer see this crash I do get disconnected or very slow connection to the AP. NetworkManager keeps asking me for a password, using the iw $name set power_save off seems to prevent it. Could be a separate bug which has been uncovered by fixing the power management?

Comment 35 Tomáš Trnka 2011-11-13 10:43:48 UTC
(In reply to comment #34)
> I'm not sure this will require a new bug, but while I no longer see this crash
> I do get disconnected or very slow connection to the AP. NetworkManager keeps
> asking me for a password, using the iw $name set power_save off seems to
> prevent it. Could be a separate bug which has been uncovered by fixing the
> power management?

Yes, there probably is some powersave-related bug in there somewhere (I've hit it twice over the last two weeks with PS enabled) - basically the device locks up and requires resetting via modprobe -r rt2500pci; modprobe rt2500pci.

Anyways, that's nowhere near as severe as this panic so I'll recommend opening a separate bug for that (otherwise I'll do that as soon as the patch for this one is applied to Fedora kernel or upstream so that the bug dependencies don't get too confusing).

Comment 36 Ian Malone 2011-11-13 23:37:32 UTC
Thanks, just wanted to make sure it wasn't likely to be directly related. https://bugzilla.redhat.com/show_bug.cgi?id=753648

Comment 37 Tomáš Trnka 2011-11-30 16:40:44 UTC
Confirming this is fixed on F15 by kernel-2.6.41.4-1.fc15.x86_64 (now in testing). 
Fixed upstream by commit ed66ba472a742cd8df37d7072804b2111cdb1014 (starting with 3.1.3).

Comment 38 bob mckay 2011-12-05 05:32:18 UTC
*** Bug 728186 has been marked as a duplicate of this bug. ***

Comment 39 bob mckay 2011-12-12 04:49:43 UTC
Confirming fixed in 2.6.41.4-1.fc15.i686.PAE also for bug 728186 (so it really was a duplicate even though symptoms were somewhat different).

Comment 40 Gareth Jones 2011-12-15 17:57:08 UTC
Confirming that this is fixed in F16 by the equivalent kernel update, but I'm also seeing the disconnecting behaviour described by Ian above.