Bug 728186
Summary: | kernel 2.6.40-4.fc15.i686.PAE freezes randomly | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | bob mckay <urilabob> | ||||||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 15 | CC: | gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | i686 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2011-12-05 05:32:18 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
bob mckay
2011-08-04 10:24:14 UTC
I'm sorry, I'm not sure what has happened here - after the most recent updates, processor.max_cstate=1 has stopped curing the problem (i.e. if I use kernel 2.6.40-4.fc15.i686.PAE, the machine now reliably freezes toward the end of the boot process, and has to be restarted by pulling power and battery.) The only way I can run the machine now is by reverting to kernel 2.6.38.8-35.fc15.i686.PAE. I'm still not sure how to provide useful diagnostics - boot.log just contains the most recent (successful) boot. Any advice how to provide anything useful would be appreciated. It's quite possible that this is a duplicate of bug 731696 - the main difference is that it generally occurs for me during boot, I don't get to the login screen - but that could just be the result of a slower system. The processor is Mobile AMD Sempron(tm) Processor 3000+, I'm attaching dmidecode information. Created attachment 519023 [details]
dmidecode output regarding system
Could you try installing kernel-debug-2.6.40.6-0.fc15 and see if you can get a backtrace instead of just a hang? I'm sorry, it seems that it doesn't help. Once the system hangs (debug or standard kernel), it is completely stuck, I couldn't find any way to get it to respond further other than by pulling all power and rebooting. Is there any way to turn on any further options in the debugging kernel that might show more logging just before it fails? It may be worth noting that the point of failure keeps changing with different updates of the 2.6.40 kernel. Earlier versions caused it to fail very early in the boot sequence (before logging started). With more recent updates, it boots OK (nothing particularly bad I could see in /var/log/messages, but maybe I don't know how to read it fully), brings up the greeter login screen, but hard-crashes on login (before it brings up the desktop - lxde in my case). However the debug kernel crashed rather earlier (I didn't carefully note this before, will check further and report in more detail). I would really appreciate any suggestions on how I can get more information that might be useful - I realise that right now, I'm not providing enough information to figure out what is going wrong, but I'm really stuck to see where I can get more information. OK, my apologies. Last time I ran the debug kernel, I came back sometime later to find the screen black and the system hung. This time, I was actually next to the machine when it crashed (during cups initialisation, in case it's relevant), and I realised there was a trace. However I can't find any sign of it in the filesystem. Is there any way to get this trace echoed to the filesystem (maybe it already is and I just don't know where to look)? Or do I need to copy it by hand (actually, I strongly suspect the relevant stuff is off the screen anyway, so this probably wouldn't be useful). Googling, all I've found is (very old) info about echoing the trace to a serial line, but unfortunately the machine doesn't have a serial port... If you have no serial port, you can simply take a picture with a camera/cell phone and attach it here. You might want to add 'pause_on_oops=<N>' where <N> is the number of seconds to pause if you think the relevant portion of the trace is scrolling off the screen. The other alternative is setting up kdump to capture a vmcore. Well fwiw, I've attached what I have managed to get so far. I'm not sure if it's useful; will try to get more of the trace next time. Created attachment 527948 [details]
Start of crash trace
Start of crash trace
Created attachment 527949 [details]
End of crash trace
End of crash trace.
I'm sorry, progress on this is slow, because now most crashes are occurring around the time the system desktop appears (which means I don't get a trace). However on one failed attempt, I did notice "Fatal: module sunrpc already in kernel" flash by, not sure whether it is relevant (In reply to comment #10) > Created attachment 527949 [details] > End of crash trace There should be at least one more screen of oops text above that one, maybe even two. Hi Chuck; thanks, but I think we are asking the impossible here, the screen scrolls far faster than a 25 frames per second mobile camera can capture. Here are the critical frames, at 40ms intervals. Any other ideas? Are there any ways to slow down the screen scrolling, for example? Created attachment 528946 [details]
critical frames of crash dump (25 fps)
Further problem isolation: the problem seems to be related to APIC - running with noapic, the system seems to run without any problems (so far, at least). Hmmm, there is one problem after all - my RT2500 card is no longer working when I run under 2.6.4 with noapic. Searching that brought me to Bug 731672, it looks highly consistent with what I am seeing, so I am marking as duplicate. *** This bug has been marked as a duplicate of bug 731672 *** |