Bug 130737

Summary: Fresh install doesn't start to boot past grub/lilo
Product: [Fedora] Fedora Reporter: Michael Douglass <mikedoug>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 2CC: pfrields, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-04-16 05:33:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
my .config file for "stripped down kernel"
none
check edid function presence. none

Description Michael Douglass 2004-08-24 03:35:35 UTC
After a fresh install of Fedora Core 2 on my Dell Dimension XPS T500,
the system refuses to boot.  This system has never run FC2 -- I
installed it on a new hard drive and still (thankfully) have my old
Red Hat 8 on the previous hard drive which I boot into to actually get
use my system.  I've done alot of debugging already, so bear with me.

The grub phase of the boot seems to work just fine.  The grub menu
comes up and I can select the kernel to boot.  The screen then blanks
out and I get this (note this is from a custom build of the kernel
that I'll discuss below):

  Booting 'Fedora Core (2.6.8-1.521custom)'

root (hd0,0)
 Filesystem type is ext2fs, partition type 0x83
kernel /vmlinuz-2.6.8-1.521custom ro root=LABEL=/1 rhgb quiet
   [Linux-bzImage, setup=0x1400, size=0x1299b6]
initrd /initrd-2.6.8-1.521custom.img
   [Linux-initrd @ 0x40c0000, 0x2d3d8 bytes]


And that is where it sits.  (Yes, label '/1' is correct -- my redhat 8
is the hard drive is where the ISOs were from which installed and it
has partitions labeled / and /boot -- and thus anaconda used /1 and
/boot.)

I have a friend with an XPS T600 (which should only differ by the
clock speed of the CPU in the system) who has absolutely no problems
with FC2.  I compared bios revision levels (and actually downgraded to
match his) and even made every setting in my BIOS match his to a T.

I have remove all unecessary devices (sound card, modem, cdrom drives)
and the bootup is the same.  I have pulled out all three sticks of my
memory and tried one at a time with no change.

I upgraded to the 2.6.8 released kernel from Fedora and no "real"
change to the problem.  At this point I was able to actually get it to
boot by being patient and letting it sit there a LONG time.  By LONG I
mean anywhere from 15 minutes to an hour (I'm not sure how long it
took as I was away both times when it booted).  Both times it was when
I manually typed in the commands at the grub command prompt -- but I
don't know if that has any real bearing as I've done that and had it
sit there for an hour without booting.

I ran across another person who was having long timeouts of 1.5
minutes at this SAME point and he fixed his by recompiling the kernel
with EDD disabled (CONFIG_EDD=n) -- thus I naturally tried that (hence
the custom kernel).

Thinking that it might POSSIBLY be grub, I installed LILO on the
system and dumped it onto the MBR.  When I hit enter at the LILO
prompt (man do I miss the old LILO days -- more sentimentality than
usability, grub is better in that sense :) I get "Loading Linux", then
some "." come across the screen and then it too stops.

In neither case do I EVER get "Uncompressing Kernel", or "Loading
zImage", or whatever the hell you're suppose to get.

The strange thing is that I grabbed a random "old" XPS R450 (PII 450)
from the office, threw the hard drive in there and it gave the same
results hanging at the same place.  I found this odd, so I moved the
hard drive into my P4 system (home grown with Intel mobo) and it
booted up just fine.  I sent the hard drive home with my coworker with
the XPS T600 system and it booted up just fine, no problems.

The 2.4 kernels appear to have no problems on this system, I'm going
to keep proding but would LOVE anyone elses thoughts and ideas on
things to try to beat this bug.

Thanks!

Comment 1 Michael Douglass 2004-08-24 03:55:21 UTC
Is there any way to turn on REALLY REALLY REALLY early debug
information (even if I have to compile it in -- or worst, add the code
in myself...)  Even if it is as simple as:

printf ("1\n");

 ...

printf ("2\n");

Comment 2 Arjan van de Ven 2004-08-24 05:56:27 UTC
the first thing to do is remove the quiet flag from the kernel
commandline. That tells the kernel to not print anything...


Comment 3 Michael Douglass 2004-08-24 06:04:04 UTC
Okay, so I went in and did some tweaking.  I went into the kernel and
configured it down to the barest possible configuration.  I will
attach my .config file shortly.  With this .config file I get a little
further, I actually get the "Uncompressing Linux...  Ok, booting the
kernel."  Now the system stops here.

I'm messing with putting debug into the various boot time assembler
files to see if I can detect where things are wedging up.  I'm also
going to sleuth out where the injectiong point into the kernel is and
put some type of print debug in there as well...  Any assistance from
someone who knows the kernel would be greatly appreciated. :)

Comment 4 Michael Douglass 2004-08-24 06:05:35 UTC
Created attachment 103014 [details]
my .config file for "stripped down kernel"

Comment 5 Michael Douglass 2004-08-24 06:08:50 UTC
I removed the 'quiet' flag as you suggested and here is what I get now
before the system stops booting (perhaps I stripped it down too much):

BIOS-provided physical RAM map:
 (The bios ram map, please let me know if you need it)
0MB HIGHMEM available.
512MB LOWMEM available.
zapping low mappings.
DMI 2.1 present.

And it stops there now.

The good thing is that it DOES uncompress the kernel and appears to
begin booting.  This is a large step forward as far as I am concerned.  

Comment 6 Michael Douglass 2004-08-24 06:22:39 UTC
Just on a hunch, I went ahead and tried pci=noacpi acpi=off and it
still stopped at the DMI 2.1 present point.

I will continue again tomorrow with adding in debug output trying to
pin point where I'm blocking up.  (Unless, of course, there is a
simpler way to debug this.)

Comment 7 Michael Douglass 2004-08-26 04:57:49 UTC
I'm still working on different things to see if I can pinpoint the
problem.  Any pointers would be great.  It seems to be somewhere in
the arch/i386/boot/ code.  Being only somewhat decent at assembler
this is going to be more difficult for me than if it was in the C code.

Thanks,

Comment 8 Michael Douglass 2004-08-27 06:13:06 UTC
Okay, after spending a bit of time in a crash course of assembler I
spent an even greater deal of time futzing around with the files in
arch/i386/boot.  My lockups are being caused by the store_edid
function in the video.S file.  (Note, these tests were done using
2.6.0; however the changes to the arch/i386/boot directory since that
kernel revision have been relatively minor -- and I got the same
lockups with 2.6.8.1.)

I can counteract this problem by disabling CONFIG_VIDEO_SELECT in the
.config file.  By doing this the store_edid function is not used and
therefore allows my kernel to boot properly.

My next step is to download the SRPM for 2.6.8.1 and recompile it with
the stock config file for the 686 non-smp with CONFIG_VIDEO_SELECT
deselected.  I will report back my (hopefully!) success at that time. 

In case it matters, or if anyone cares, this system has an old Diamond
Viper V770D Ultra Nvidia (OEM from Dell).  I haven't had any problems
with it under the 2.4 kernel running X and all -- I did note that the
2.4 kernel did NOT have this edid function either though.

Let me know if I can be of any assistance debugging this futher in
hopes of a potential fix.  I'm extremely computer literate and a
strong programmer.

Thanks.

Comment 9 Michael Douglass 2004-08-27 06:38:17 UTC
Color me intrigued...  I found this referencing my problem:

http://lkml.org/lkml/2003/5/20/110

Reading that it sounds like it would be trivial to use the
installation check function first, perhaps that would prevent the problem.

I am confused at why you memset the memory range to 0x13131313 first,
and then fill it with the edid information.  If CONFIG_VIDEO_SELECT is
disabled, that range of memory isn't even initialized at all.  I may,
just for giggles attempt to put in the test call and have it jump over
the offending call if it fails.  Should the memory range be
initialized to 0x13131313 or not in that case?

Thanks,

Comment 10 Dave Jones 2005-02-13 06:18:14 UTC
Created attachment 111033 [details]
check edid function presence.

Patch to do the check will probably be this.
Can you do a build with this, and let me know how that works out for you?

I'm going to try it on a few boxes here to be sure nothing regresses before
I commit this to the Fedora tree.  If it works out ok, I'll push it upstream.

Comment 11 Dave Jones 2005-04-16 05:33:44 UTC
Fedora Core 2 has now reached end of life, and no further updates will be
provided by Red Hat.  The Fedora legacy project will be producing further kernel
updates for security problems only.

If this bug has not been fixed in the latest Fedora Core 2 update kernel, please
try to reproduce it under Fedora Core 3, and reopen if necessary, changing the
product version accordingly.

Thank you.