Bug 728186

Summary: kernel 2.6.40-4.fc15.i686.PAE freezes randomly
Product: [Fedora] Fedora Reporter: bob mckay <urilabob>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 15CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-05 05:32:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmidecode output regarding system
none
Start of crash trace
none
End of crash trace
none
critical frames of crash dump (25 fps) none

Description bob mckay 2011-08-04 10:24:14 UTC
Description of problem:
Machine freezes randomly after boot

Version-Release number of selected component (if applicable):
2.6.40-4.fc15.i686.PAE

How reproducible:
Reliably after ~60 seconds

Steps to Reproduce:
1. Install kernel 2.6.40-4.fc15.i686.PAE
2. Boot
  
Actual results:
System freezes either during boot or within the first 30 seconds or so of running. Can only be unfrozen by pulling power.

Expected results:
System runs normally.

Additional info:
This looks awfully like a regression of BugĀ 704059 - "Machine doesn't smoothly boot after installing 2.6.35.13-91.fc14.x86_64"
Similarities:
.it affects exactly the same machine (none of my other machines are affected) - averatec notebook with AMD mobile sempron 3000+ processor
.in the same way (random freezes)
.adding "processor.max_cstate=1" to the boot options cures it

I'm happy to provide any other necessary information - just not sure what is relevant.

Comment 1 bob mckay 2011-08-17 09:13:24 UTC
I'm sorry, I'm not sure what has happened here - after the most recent updates, processor.max_cstate=1 has stopped curing the problem (i.e. if I use kernel 2.6.40-4.fc15.i686.PAE, the machine now reliably freezes toward the end of the boot process, and has to be restarted by pulling power and battery.) The only way I can run the machine now is by reverting to kernel 2.6.38.8-35.fc15.i686.PAE. I'm still not sure how to provide useful diagnostics - boot.log just contains the most recent (successful) boot. Any advice how to provide anything useful would be appreciated.

Comment 2 bob mckay 2011-08-19 11:46:38 UTC
It's quite possible that this is a duplicate of bug 731696 - the main difference is that it generally occurs for me during boot, I don't get to the login screen - but that could just be the result of a slower system. The processor is Mobile AMD Sempron(tm) Processor 3000+, I'm attaching dmidecode information.

Comment 3 bob mckay 2011-08-19 11:47:53 UTC
Created attachment 519023 [details]
dmidecode output regarding system

Comment 4 Josh Boyer 2011-10-11 13:27:09 UTC
Could you try installing kernel-debug-2.6.40.6-0.fc15 and see if you can get a backtrace instead of just a hang?

Comment 5 bob mckay 2011-10-12 09:29:30 UTC
I'm sorry, it seems that it doesn't help. Once the system hangs (debug or standard kernel), it is completely stuck, I couldn't find any way to get it to respond further other than by pulling all power and rebooting. Is there any way to turn on any further options in the debugging kernel that might show more logging just before it fails? 

It may be worth noting that the point of failure keeps changing with different updates of the 2.6.40 kernel. Earlier versions caused it to fail very early in the boot sequence (before logging started). With more recent updates, it boots OK (nothing particularly bad I could see in /var/log/messages, but maybe I don't know how to read it fully), brings up the greeter login screen, but hard-crashes on login (before it brings up the desktop - lxde in my case). However the debug kernel crashed rather earlier (I didn't carefully note this before, will check further and report in more detail). 

I would really appreciate any suggestions on how I can get more information that might be useful - I realise that right now, I'm not providing enough information to figure out what is going wrong, but I'm really stuck to see where I can get more information.

Comment 6 bob mckay 2011-10-12 10:19:32 UTC
OK, my apologies. Last time I ran the debug kernel, I came back sometime later to find the screen black and the system hung. This time, I was actually next to the machine when it crashed (during cups initialisation, in case it's relevant), and I realised there was a trace. However I can't find any sign of it in the filesystem. Is there any way to get this trace echoed to the filesystem (maybe it already is and I just don't know where to look)? Or do I need to copy it by hand (actually, I strongly suspect the relevant stuff is off the screen anyway, so this probably wouldn't be useful). Googling, all I've found is (very old) info about echoing the trace to a serial line, but unfortunately the machine doesn't have a serial port...

Comment 7 Josh Boyer 2011-10-12 14:33:49 UTC
If you have no serial port, you can simply take a picture with a camera/cell phone and attach it here.  You might want to add 'pause_on_oops=<N>' where <N> is the number of seconds to pause if you think the relevant portion of the trace is scrolling off the screen.

The other alternative is setting up kdump to capture a vmcore.

Comment 8 bob mckay 2011-10-13 10:23:16 UTC
Well fwiw, I've attached what I have managed to get so far. I'm not sure if it's useful; will try to get more of the trace next time.

Comment 9 bob mckay 2011-10-13 10:25:08 UTC
Created attachment 527948 [details]
Start of crash trace

Start of crash trace

Comment 10 bob mckay 2011-10-13 10:26:02 UTC
Created attachment 527949 [details]
End of crash trace

End of crash trace.

Comment 11 bob mckay 2011-10-13 10:54:56 UTC
I'm sorry, progress on this is slow, because now most crashes are occurring around the time the system desktop appears (which means I don't get a trace). However on one failed attempt, I did notice 
"Fatal: module sunrpc already in kernel"
flash by, not sure whether it is relevant

Comment 12 Chuck Ebbert 2011-10-18 05:00:24 UTC
(In reply to comment #10)
> Created attachment 527949 [details]
> End of crash trace

There should be at least one more screen of oops text above that one, maybe even two.

Comment 13 bob mckay 2011-10-19 09:28:24 UTC
Hi Chuck; thanks, but I think we are asking the impossible here, the screen scrolls far faster than a 25 frames per second mobile camera can capture. Here are the critical frames, at 40ms intervals. Any other ideas? Are there any ways to slow down the screen scrolling, for example?

Comment 14 bob mckay 2011-10-19 09:33:13 UTC
Created attachment 528946 [details]
critical frames of crash dump (25 fps)

Comment 15 bob mckay 2011-12-05 05:19:53 UTC
Further problem isolation: the problem seems to be related to APIC - running with noapic, the system seems to run without any problems (so far, at least).

Comment 16 bob mckay 2011-12-05 05:32:18 UTC
Hmmm, there is one problem after all - my RT2500 card is no longer working when I run under 2.6.4 with noapic. Searching that brought me to Bug 731672, it looks highly consistent with what I am seeing, so I am marking as duplicate.

*** This bug has been marked as a duplicate of bug 731672 ***