Bug 731672
Summary: | (rt2x00) Kernel 2.6.40.3-0.fc15.x86-64 causes kernel panic | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Dirk Foerster <dirk.foerster> | ||||||||||||||
Component: | kernel | Assignee: | Stanislaw Gruszka <sgruszka> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||
Priority: | unspecified | ||||||||||||||||
Version: | 15 | CC: | aquini, gansalmon, gareth.k.jones, gwingerde, ibmalone, itamar, ivdoorn, james, jl-icase, jonathan, kernel-maint, madhu.chinakonda, ribenakid, sgruszka, siemons, tomastrnka, urilabob | ||||||||||||||
Target Milestone: | --- | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | 2.6.41.4-1.fc15 | Doc Type: | Bug Fix | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2012-01-11 15:04:16 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Description
Dirk Foerster
2011-08-18 10:13:42 UTC
I also see this or similar behaviour with both 2.6.40 kernel updates, but not 2.6.38. Often the system doesn't drop to text mode, or the screen is scrambled with panic text overlaid. I also get a panic if I login in text-mode, but it usually takes longer. The most common panic is "BUG: unable to handle kernel NULL pointer dereference", although sometimes it's "Not syncing". /var/log/messages contains over 300 MiB of such text: "BUG: scheduling while atomic: swapper" "bad: scheduling from the idle thread!" My system: Fedora 15 x86-64, Intel Core i7 940, Asus P6T Deluxe motherboard, NVIDIA 9800 GT graphics with default up-to-date nouveau driver, untainted kernel. (Screen-shots and log extracts available on request.) Gareth Probably related to bug 732008 which has a crash screen picture (that bug occurs immediately on boot, within one second) > (Screen-shots and log extracts available on request.)
We really need only the first one from each session.
A lot of the time an oops or similar warning will occur, and then the kernel state is so messed up that 'follow-up' oopses happen afterwards that aren't really useful.
Created attachment 519736 [details]
Extract from /var/log/messages
Extract from /var/log/messages, starting just before the first kernel oops (as far as I can tell). This and similar errors then repeat for 350 MiB.
Created attachment 519737 [details]
Kernel message as on-screen
This is the sort of thing I see on the screen when the error happens - if I'm lucky enough to see anything on-screen at all: often it just freezes.
Had the same troubles. My temporary solution: Install "Kernel 2.6.40.3-0.fc15.x86-64.debug" e.g. by using yumextender. It worked. No idea why though ... there's a lot going on here, but my gut is telling me this is related to the wireless driver (if only because it's the only prominent thing in the stack traces). Maybe John has some clues.. Dirk, does this look like the same thing you're seeing ? That sounds reasonable. I've noticed a single panic with 2.6.38 which seemed to be rt2500-related. I'll attach my lspci -vvxxx output in case it helps. If there's a general feeling that this is related to the rt2500 driver I can ask on the FedoraForums to see if anyone else is having similar problems. Gareth Created attachment 521064 [details]
lspci -vvxxx output
The rt2500 hardware is near the end of the file.
Looks like it could relate to power saving mode. You might try this: iw dev wlan0 set power_save off Does that change the issue? I added "iw ..." to my /etc/rc.local (I didn't want to bet on having time to type it manually). So far, so good - 25 mins uptime, which is *considerably* more than I've ever managed with 2.6.40 before. It's pub time for me now, but I'll report back when I've got some serious usage out of it tomorrow. Gareth Okay, just to confirm, the suggested iw work-around makes 2.6.40 usable for me. (In reply to comment #7) > there's a lot going on here, but my gut is telling me this is related to the > wireless driver (if only because it's the only prominent thing in the stack > traces). > > Maybe John has some clues.. > > Dirk, does this look like the same thing you're seeing ? I've made some photos of the kernel panic message since the system freezes and doesn't allow for screen shots. Created attachment 521389 [details]
Photos of kernel panic message after system freeze
Dirk and Roland, are you using the rt2500 wireless driver (or similar)? If so, does John's command stop the kernel panics? Also take a look at your /var/log/messages, as what appears on screen is probably the end result of the problem, not necessarily the start. That'd be the easiest way to see if you're affected by the same bug as me, whether it's more complicated, or even multiple different bugs. Gareth (In reply to comment #15) > Dirk and Roland, are you using the rt2500 wireless driver (or similar)? If so, > does John's command stop the kernel panics? > > Also take a look at your /var/log/messages, as what appears on screen is > probably the end result of the problem, not necessarily the start. > > That'd be the easiest way to see if you're affected by the same bug as me, > whether it's more complicated, or even multiple different bugs. > > Gareth Using rt2500pci rt2x00lib John's command seems to stop the kernel panic. System is up one hour now. Without iw dev wlan0 set power_save off panic occurred after a couple of minutes. Regarding /var/log/messages not sure which part might be of any interest now. Okay, ditto, I think that's enough to be sure we're seeing the same bug. (In reply to comment #15) > Dirk and Roland, are you using the rt2500 wireless driver (or similar)? If so, > does John's command stop the kernel panics? > > Also take a look at your /var/log/messages, as what appears on screen is > probably the end result of the problem, not necessarily the start. > > That'd be the easiest way to see if you're affected by the same bug as me, > whether it's more complicated, or even multiple different bugs. > > Gareth Gareth, That PC has no wireless. Therefore, I suppose it has no wireselss drivers installed. Or could it? Furher, I cannot read the screen so fast. If you need exact info about what happens, please tell me which file to read. And also how to access it, because ... since a while Fedora has changed so that I can no longer operate root in graphical mode. As a dillettant, I find it very difficult to keep this operating system in the air now. Would you have an advice about this? Roland Hello, I'm seeing the rt2500pci powersave-related panic reproducibly (the kernel panics a few seconds after enabling powersave). I've got a core via kdump on 2.6.40.6 (can provide the whole 2 GiB file if anyone's interested). I'm attaching the panic log, but the most important part is: [ 853.871216] [<ffffffff810620a6>] msleep+0x1b/0x22 [ 853.871230] [<ffffffffa0314423>] rt2500pci_set_device_state+0x840/0x8a0 [rt2500pci] [ 853.871241] [<ffffffffa0314ac7>] rt2500pci_config+0x297/0x2bd [rt2500pci] [ 853.871257] [<ffffffffa02c319e>] rt2x00lib_config+0x144/0x22a [rt2x00lib] [ 853.871269] [<ffffffffa02c1355>] rt2x00lib_rxdone+0x2a9/0x37b [rt2x00lib] [ 853.871280] [<ffffffffa02d455e>] rt2x00pci_rxdone+0x76/0x8b [rt2x00pci] [ 853.871290] [<ffffffffa03144b3>] rt2500pci_rxdone_tasklet+0x14/0x59 [rt2500pci] Apparently, rt2500pci_set_device_state check whether the requested state change has succeeded in a loop and doesn't care that it could be called from rt2500pci_rxdone_tasklet (after receiving a beacon the tasklet orders to return to powersaving mode if there's no traffic queued): rt2500pci.c:1212 /* * Device is not guaranteed to be in the requested state yet. * We must wait until the register indicates that the * device has entered the correct state. */ for (i = 0; i < REGISTER_BUSY_COUNT; i++) { rt2x00pci_register_read(rt2x00dev, PWRCSR1, ®2); bbp_state = rt2x00_get_field32(reg2, PWRCSR1_BBP_CURR_STATE); rf_state = rt2x00_get_field32(reg2, PWRCSR1_RF_CURR_STATE); if (bbp_state == state && rf_state == state) return 0; rt2x00pci_register_write(rt2x00dev, PWRCSR1, reg); msleep(10); } So powersave works only for people with devices fast enough to switch state instantly (before the CPU gets to the inner if check). Everyone else steps on the msleep and explodes in softirq context. "Quick fix": Either drop the msleep() and let it spin a bit or check whether in interrupt and completely skip the loop in that case. Tip: Since rc.local is sooo pre-systemd era (and putting powersave off there is not reliable, too, since the wifi could come up and panic before rc.local is executed), the best is to do add a simple rule to /etc/udev/rules.d to have powersaving off from the very start: SUBSYSTEM=="net", ACTION=="add", DRIVERS=="rt2500pci", KERNEL=="wlan*", RUN="/sbin/iw $name set power_save off" Created attachment 528933 [details]
Panic log including subsequent fallout
Ivo, any thoughts on comment 19? I think I'm encountering this in F16 on a laptop, running off the Live USB with kernel 3.1.0-0.rc6.git0.3.fc16.i686. Connect to a wireless network, enter the password, and it panics. Disabling power management via iw seems to do the trick... I'll attach the machine's particulars (backtraces look similar) if anyone thinks they'll be of use. (In reply to comment #21) > Ivo, any thoughts on comment 19? Sounds like a very valid point, I guess it was introduced when we were moving the interrupts from process to IRQ context back and forth.. :( Helmut works (or plan to work) on that: http://marc.info/?l=linux-wireless&m=131702522217100&w=2 Created attachment 531046 [details]
Proposed patch
Please check if the attached patch fixes the issue for you.
(In reply to comment #25) > Created attachment 531046 [details] > Proposed patch > > Please check if the attached patch fixes the issue for you. How would I do that? I'll prepare kernel build with the patch. Here is kernel build with patch from comment 25, please test when it finish to compile: http://koji.fedoraproject.org/koji/taskinfo?taskID=3478051 Thanks for that, ran out of space when trying to do my own kernel build (9GB not enough!) kernel-2.6.40.6-0.fc15.x86_64 (Current stable I think) has this crash for me when not using set power_save off. kernel-2.6.40.8-3.bz731672.fc15.x86_64 Hasn't crashed yet at ~ 1/2hour uptime. (In reply to comment #25) > Created attachment 531046 [details] > Proposed patch > > Please check if the attached patch fixes the issue for you. Indeed it does, thanks. kernel-2.6.40.8-2.fc15.x86_64 - crashes within a few seconds after enabling powersave kernel-2.6.40.8-2.rhbz731672.fc15.x86_64 - can't reproduce the crash, power management works as expected now (judging from the "STA will sleep" flag being set in transmitted frames) So, when will we see this patch upstream? :-) As soon as I have returned from my travels I will submit the patch upstream (sorry travels came in between that) ;-) I'm currently updating to F16, but I'll give the patch a test when I get a chance (if it isn't already included by then!). I'm not sure this will require a new bug, but while I no longer see this crash I do get disconnected or very slow connection to the AP. NetworkManager keeps asking me for a password, using the iw $name set power_save off seems to prevent it. Could be a separate bug which has been uncovered by fixing the power management? (In reply to comment #34) > I'm not sure this will require a new bug, but while I no longer see this crash > I do get disconnected or very slow connection to the AP. NetworkManager keeps > asking me for a password, using the iw $name set power_save off seems to > prevent it. Could be a separate bug which has been uncovered by fixing the > power management? Yes, there probably is some powersave-related bug in there somewhere (I've hit it twice over the last two weeks with PS enabled) - basically the device locks up and requires resetting via modprobe -r rt2500pci; modprobe rt2500pci. Anyways, that's nowhere near as severe as this panic so I'll recommend opening a separate bug for that (otherwise I'll do that as soon as the patch for this one is applied to Fedora kernel or upstream so that the bug dependencies don't get too confusing). Thanks, just wanted to make sure it wasn't likely to be directly related. https://bugzilla.redhat.com/show_bug.cgi?id=753648 Confirming this is fixed on F15 by kernel-2.6.41.4-1.fc15.x86_64 (now in testing). Fixed upstream by commit ed66ba472a742cd8df37d7072804b2111cdb1014 (starting with 3.1.3). *** Bug 728186 has been marked as a duplicate of this bug. *** Confirming fixed in 2.6.41.4-1.fc15.i686.PAE also for bug 728186 (so it really was a duplicate even though symptoms were somewhat different). Confirming that this is fixed in F16 by the equivalent kernel update, but I'm also seeing the disconnecting behaviour described by Ian above. |