Bug 470551

Summary:

powernow-k8 causing SIGSEGV during the boot

Product:

[Fedora] Fedora

Reporter:

Martin Klapetek <martin.klapetek>

Component:

kernel

Assignee:

Bhavna Sarathy <bnagendr>

Status:

CLOSED WONTFIX

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

medium

Version:

CC:

jfeeney, kernel-maint, peterm, rhbugzilla

Target Milestone:

---

Keywords:

Triaged

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-12-18 06:46:59 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Captured dmesg output after reboot after the crash	none
Captured /var/log/messages output after reboot after the crash	none
lsmod output	none
Backtrace screen photo	none

Description Martin Klapetek 2008-11-07 16:43:03 UTC

Description of problem:
Sometimes (1 of 3 restarts) the boot process stops with message (translated from czech l10n):
Beginning non-interactive setup
/etc/rc5.d/S06cpuspeed: line 112: 1838 Unauthorized memory access (SIGSEGV)   /sbin/modprobe powernow-k8 2>/dev/null

Version-Release number of selected component (if applicable):
I'm using kernel 2.6.27.4-79.fc10.x86_64

How reproducible:
I found out that it almost certainly cause the SIGSEGV if I shut it off manually by power button and then power it on after a while. But it also happens after normal system reboot sometimes.

Steps to Reproduce:
1. Shut down in fully loaded system (even kdm)
2. Power on after a while
3. Here goes the SIGSEGV
  
Actual results:
The boot stops

Expected results:
Not happen at all

Additional info:
I'm running Fedora 10 Preview on AMD Turion X2 64bit 2GHz (puma platform - RM-70)

Comment 1 Dave Jones 2008-11-07 17:22:57 UTC

the next time it happens, can you capture the output of dmesg afterwards, and attach that please?

Comment 2 Martin Klapetek 2008-11-09 09:37:27 UTC

Okay, I have the output of dmesg, but there's nothing about that SIGSEGV. I captured it after reboot though, cause when it happens, the system hangs, so I cannot capture it right after. But the interesting thing is, that there's nothing even in /var/log/messages (also attached), there's no single line about that boot. Is there any other log file in which can this be logged? The /var/log/messages contains info from 00:34 and then the next entries are from 10:02, but the boot when it crashed was before 10:00, so you can see there's no line about it.

Any suggestions how to capture that output or where to look?

Comment 3 Martin Klapetek 2008-11-09 09:38:45 UTC

Created attachment 322995 [details]
Captured dmesg output after reboot after the crash

Comment 4 Martin Klapetek 2008-11-09 09:39:48 UTC

Created attachment 322996 [details]
Captured /var/log/messages output after reboot after the crash

Comment 5 Dave Jones 2008-11-09 17:26:40 UTC

ah, I was hoping it would survive long enough to capture a backtrace. The post-reboot ones aren't really helpful. Though it does show the driver loads and inits successfully.

It's odd because that driver hasn't changed in a long time.

Can you try this..

modprobe cpufreq_ondemand
echo ondemand > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

and see if it locks up in the same way?

Comment 6 Martin Klapetek 2008-11-09 20:09:34 UTC

Nope, it does not. I tried echoing it to the both processors and the laptop continues to run with no problem. I also found out, that cpufreq_ondemand is loaded by default and also in /sys/.../scaling_governor is 'ondemand' as default. I'm posting my lsmod here, maybe it can help...

Comment 7 Martin Klapetek 2008-11-09 20:10:16 UTC

Created attachment 323023 [details]
lsmod output

Comment 8 Martin Klapetek 2008-11-09 20:19:07 UTC

I'm thinking....isn't there any switch to boot the kernel with, which would cause it to log immediately? Maybe I'll try without quiet and then move up the screen if that will be possible. And what about bootstrap? Does it write the output immediately or it keeps it in the memory and dumps it later? Could the bootstrap help us here?

Comment 9 Martin Klapetek 2008-11-11 14:39:14 UTC

Oops sorry, I meant Bootchart, not bootstrap :)

Comment 10 Martin Klapetek 2008-11-11 15:43:13 UTC

Actually the bootchart did help. I booted the kernel with it and it showed the whole bug backtrace. I'm including a photo of the backtrace screen. If something's not readable, I have more detailed and clear photos, so just ask for them :)

Comment 11 Martin Klapetek 2008-11-11 15:44:43 UTC

Created attachment 323182 [details]
Backtrace screen photo

Comment 12 Dave Jones 2008-11-11 15:53:33 UTC

ah, excellent. thanks.

Comment 13 Martin Klapetek 2008-11-18 19:53:08 UTC

Any update on this? I'm running 2.6.27.5-109.fc10.x86_64 and that SIGSEGV is still present. It's getting pretty annoying that about every second/third boot I need to turn off the laptop and then turn it back on (it's even more annoying as laptops do not have a reset button :)

Comment 14 Dave Jones 2008-11-18 20:05:44 UTC

No fix yet.  We've actually seen reports of this dating back some time, even as far back as RHEL5 (2.6.18).

Comment 15 Bug Zapper 2008-11-26 04:59:50 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 16 Dave Jones 2008-12-09 20:39:41 UTC

*** Bug 462648 has been marked as a duplicate of this bug. ***

Comment 17 Dave Jones 2008-12-09 21:05:24 UTC

I wish I could find a machine to reproduce this on.   In the meantime, something to try..
there's a kernel boot parameter called printk.bootdelay=1000
setting that will cause a delay of 1 second to occur every time the kernel prints a message. So your boot will go _really slowly_.

The plus side of this however is that we can see messages before they scroll off the screen.  I'm curious what cpufreq is doing just before that oops happens, so boot with   cpufreq.debug=7 printk.bootdelay=1000
(you can increase the 1000 for longer delays, 2000=2s, 3000=3s etc up to 10s).

If you could capture the screen before that first [cut here] message appears, that might yield some more clues.

Thanks.

Oh, and I'm pretty sure it's unrelated but both this report and the one in 462648 are tainted with binary modules.  It's unlikely that two different modules are causing the same problem here, but just to rule it out if you could try and reproduce the trace without it loaded, that would be one less thing to worry about.

Comment 18 Martin Klapetek 2008-12-09 21:30:20 UTC

By binary modules you mean proprietary drivers like amd's fglrx? Well the only
"non-standard" modules that I'm using are broadcom's driver wl, amd's fglrx
(the first report was without fglrx though) and a custom built wacom
(touchscreen) driver. Wacom and fglrx have no impact on that as the original
report was without these, then only the wl has left. It actually may be
related, because I noticed, that it usually happens right before the wifi led
should light up blue (I mean if it lights up normally let's say at 15th second,
the crash happens at 14th second), but this may be just a pure accident. 

I'll try the thing you suggested and also I'll try to investigate the wl
further and I'll post back as soon as I'll found something useful.

Comment 19 Dave Jones 2008-12-09 21:35:42 UTC

Thanks.
Also, I typoed the above.  Looks like it should be  printk.boot_delay=1000

I missed the _ character.

Comment 20 Tom Mitchell 2008-12-09 22:23:05 UTC

    Some are having problems capturing the backtrace.  By
    moving cpuspeed to be almost last in the startup sequence
    the error can often be captured and is apparently sent automatically
    to kernel.org via kerneloops

    i.e 
           service cpuspeed stop
           chkconfig cpuspeed off
       change the chkconfig line of /etc/init.d/cpuspeed to have 99...
       # chkconfig: 12345 99 99
       then re enable it..
           chkconfig cpuspeed on
           service cpuspeed start

    I do not know of a simple way to update the rpm toward this end
    but if someone does then the kerneloops data might make this
    easy to track and measure.    There is no reason I know to
    have this chkconfig'ed at 06...and it might gather more info from
    'untainted' kernels if it was started much later.

    To reproduce I was previously able to trigger it with with a loop
    "service cpuspeed stop; sleep 1; service cpuspeed start" and 
    capture the oops back trace.

    I still see this. Currently I have fc9 kernel:
       2.6.27.5-41.fc9.x86_64

Comment 21 Brian Maly 2008-12-10 09:21:19 UTC

Is it possible test a kernel patch on the affected hardware? If so I can spin a test patch. Thanks.

Comment 22 Martin Klapetek 2008-12-10 21:09:47 UTC

I'm able and willing to test such kernel, let me know about that spin.

Comment 23 Martin Klapetek 2008-12-30 15:00:15 UTC

I just want to let you know, that with the pre-last kernel updated (I think it is 2.6.27.7-137 or something like that, not the newest update though) this crash appears much less often then with the kernel I reported with. If then it was every 3rd boot, now it is like every 15th boot. Also I've got suspend to ram working, so I'm mostly just suspending, but even though sometimes it won't wake up and everything including keyboard (caps lock LED) is dead. Could it be that it loads the module again after waking and therefore it hangs the kernel?

Comment 24 Brian Maly 2009-02-03 02:48:19 UTC

I spun a FC9 kernel with debugging turned on in powernow-k8. Give this a try and attach all debug output. 

http://people.redhat.com/bmaly/kernel-2.6.28.2-6.fc9.x86_64.rpm 


BTW, from the console output in Comment #11, it looks like the policy (target CPU freq) the governor is attempting to set is null. Lets see if there is any obvious failure that gets logged before we start disecting code.

Comment 25 Chuck Ebbert 2009-02-03 21:44:50 UTC

(In reply to comment #24)
> I spun a FC9 kernel with debugging turned on in powernow-k8. Give this a try
> and attach all debug output. 
> 
> http://people.redhat.com/bmaly/kernel-2.6.28.2-6.fc9.x86_64.rpm 
> 
> 

F9 development for the 2.6.27 kernel is continuing in a branch:

   private-fedora-9-2_6_27-branch

We may or may not move F9 to 2.6.29 on the trunk later, but for now it's dead.

Comment 26 Martin Klapetek 2009-02-08 16:51:30 UTC

I treid to install but it left me with this:


sudo rpm -Uvf kernel-2.6.28.2-6.fc9.x86_64.rpm
error: Failed dependencies
        kernel-firmware >= 2.6.28.2-6.fc9 is needed for kernel-2.6.28.2-6.fc9.x86_64
        kernel-uname-r = 2.6.27.7-134.fc10.x86_64  is needed for (installed) kmod-wl-2.6.27.7-134.fc10.x86_64-5.10.27.6-5.fc10.7.x86_64
        kernel-uname-r = 2.6.27.9-159.fc10.x86_64 is needed for (installed) kmod-wl-2.6.27.9-159.fc10.x86_64-5.10.27.12-1.fc10.x86_64
        kernel-uname-r = 2.6.27.12-170.2.5.fc10.x86_64 is needed for (installed) kmod-wl-2.6.27.12-170.2.5.fc10.x86_64-5.10.27.12-1.fc10.1.x86_64

The strings may be slightly differ as I translated them into english from my native language. I'm not sure what to do, would removing the kmod-wl be just sufficient? I'm asking first, because I don't want to break my working distro.
 
I also have to note, that I haven't seen this bug occurrence for quite some time now, using 2.6.27.12-170.2.5.fc10.x86_64

Comment 28 Bug Zapper 2009-11-18 07:58:08 UTC

This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 29 Bug Zapper 2009-12-18 06:46:59 UTC

Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.