144050 – Kernel boot hangs at boot after initrd message (before decompressing kernel)

Bug 144050 - Kernel boot hangs at boot after initrd message (before decompressing kernel)

Summary: Kernel boot hangs at boot after initrd message (before decompressing kernel)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Matt Domsch
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-01-04 04:01 UTC by Mace Moneta
Modified:	2007-11-30 22:10 UTC (History)
CC List:	5 users (show)
Fixed In Version:	FC4
Clone Of:
Environment:
Last Closed:	2006-01-17 02:29:41 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
New 2.6.10-737 config based on 2.6.9-681 config (44.55 KB, text/plain) 2005-01-11 01:01 UTC, Mace Moneta	no flags	Details
Diff between 2.6.9-681 config and newly created 2.6.10-737 config (6.59 KB, text/plain) 2005-01-11 01:03 UTC, Mace Moneta	no flags	Details
Diff between newly created 2.6.10-737 config and Fedora original 2.6.10-737 config (7.28 KB, text/plain) 2005-01-11 01:35 UTC, Mace Moneta	no flags	Details
EDD data collection (42.16 KB, text/plain) 2005-01-12 21:05 UTC, Mace Moneta	no flags	Details
edd-get-disk-type-before-read.patch (1.33 KB, patch) 2005-08-08 20:27 UTC, Matt Domsch	no flags	Details \| Diff
View All

Description Mace Moneta 2005-01-04 04:01:36 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5)
Gecko/20041111 Firefox/1.0

Description of problem:
kernel-2.6.9-1.724_FC3 hangs at boot, after selection from grub
screen.  Last message displayed is the initrd message (it never gets
to decompressing kernel).  Previous kernels (e.g.,
kernel-2.6.9-1.681_FC3) boot fine.

Version-Release number of selected component (if applicable):
kernel-2.6.9-1.724_FC3

How reproducible:
Always

Steps to Reproduce:
1.yum upgrade kernel
2.reboot
3.select kernel-2.6.9-1.724_FC3 from grub menu
4.hang
    

Actual Results:  hang

Expected Results:  Decompressing kernel and normal boot.

Additional info:

Boot disk is IDE RAID-1.

Comment 1 Mace Moneta 2005-01-04 04:03:10 UTC

Just a clarification, the boot disk is IDE software RAID-1.

Comment 2 Dave Jones 2005-01-04 04:13:30 UTC

do you have 'quiet' on the boot command line ? If so, can you remove
it, and find out where its hanging ?

Comment 3 Mace Moneta 2005-01-04 05:14:27 UTC

No, I removed the quiet parameter first thing.  It just hangs right
after the initrd messages.  I ctrl-alt-del reboot, select the previous
kernel and all is well, so it doesn't look like a grub issue.  The two
grub entries are:

title Fedora Core (2.6.9-1.724_FC3)
        root (hd0,0)
        kernel /vmlinuz-2.6.9-1.724_FC3 ro root=/dev/VolGroup00/LogVol00
        initrd /initrd-2.6.9-1.724_FC3.img

title Fedora Core (2.6.9-1.681_FC3)
        root (hd0,0)
        kernel /vmlinuz-2.6.9-1.681_FC3 ro
root=/dev/VolGroup00/LogVol00 noapic pci=usepirqmask quiet
        initrd /initrd-2.6.9-1.681_FC3.img

I needed the "noapic pci=usepirqmask" because of bug 131404 that was
causing DMA timeouts on the prior kernels.  The fix is in the
2.6.9-1.724 kernel, so I don't need those parameters on that kernel
(in case you were wondering).  The initrd file looks reasonable:

-rw-r--r--  1 root root 1600271 Dec 11 16:59
/boot/initrd-2.6.9-1.681_FC3.img
-rw-r--r--  1 root root 1598498 Jan  3 21:00
/boot/initrd-2.6.9-1.724_FC3.img

as do the kernels:

-rw-r--r--  1 root root 1732455 Nov 18 15:23 /boot/vmlinuz-2.6.9-1.681_FC3
-rw-r--r--  1 root root 1727262 Jan  2 15:53 /boot/vmlinuz-2.6.9-1.724_FC3

This is an ASUS K8V SE Deluxe motherboard, with an AMD64 3200+ CPU
(512 KB cache).  The drives are RAID-1 (md0 is /boot):

# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 hda2[0] hdg2[1]
      2097024 blocks [2/2] [UU]
      
md2 : active raid1 hda3[0] hdg3[1]
      75977920 blocks [2/2] [UU]
      
md0 : active raid1 hda1[0] hdg1[1]
      102208 blocks [2/2] [UU]
      
unused devices: <none>

Comment 4 Dave Jones 2005-01-04 05:26:28 UTC

did the updates-testing kernels also hang for you ?
You can still grab them (-698 and -715) from
http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/3/i386/

Comment 5 Mace Moneta 2005-01-04 05:55:01 UTC

OK, I grabbed the -698 and -715 kernels (from the x86_64 directory). 
The 698 kernel boots OK, but the 715 kernel hangs the same as 724.  I
hope that helps narrow the changes some.

Comment 6 Mace Moneta 2005-01-09 00:14:31 UTC

Well, if it helps any, I've eliminated patches 207, 208, 209, 1148,
1150, 1151, 1890, 1950, 1951, 1952, and 10000 as the source of the
problem.  Compiling the kernel without these patches doesn't help, the
-724 kernel still hangs on boot.  

If anyone has a specific suspect patch in mind, let me know.

Comment 7 Dave Jones 2005-01-09 00:18:25 UTC

does the 2.6.10-1.735 kernel at
http://people.redhat.com/davej/kernels/Fedora/FC3/RPMS.kernel/ work
any better ?

Comment 8 Mace Moneta 2005-01-09 00:36:53 UTC

No, same problem.

Comment 9 Stephen Adler 2005-01-09 14:43:08 UTC

I just want to chime in. I have the same problem with my AMD64 x86
platform. (MSI K8T neo motherboard based on the VIA K8T800 chip
set) I'm adding myself to the cc list so that I can follow the
debug.

Comment 10 Mace Moneta 2005-01-11 01:01:52 UTC

Created attachment 109586 [details]
New 2.6.10-737 config based on 2.6.9-681 config

Comment 11 Mace Moneta 2005-01-11 01:03:09 UTC

Created attachment 109588 [details]
Diff between 2.6.9-681 config and newly created 2.6.10-737 config

Comment 12 Mace Moneta 2005-01-11 01:04:18 UTC

OK, I have a resolution, but not a specific cause.  After
incrementally eliminating all the patches in -724 (going back to
vanila 2.6.9) and still having the no boot problem on my AMD64, I
figured it must be the kernel config file.

I downloaded the new 2.6.10-737 kernel (which also wouldn't boot), and
built it using the 2.6.9-681 config file (taking mostly defaults for
the new configuration parameters).  It boots just fine now.

Attached are the 2.6.10-737 config I used, and the diff from the
2.6.9-681 config.  Hopefully someone will notice a configuration
option that was causing the grief.

Comment 13 Mace Moneta 2005-01-11 01:35:49 UTC

Created attachment 109589 [details]
Diff between newly created 2.6.10-737 config and Fedora original 2.6.10-737 config

Comment 14 Nathaniel Daw 2005-01-11 15:11:33 UTC

Another me too: I have the same problem with 2.6.10-737 (also
2.6.10-1.1075_FC4 from devel) on my Athlon XP 2000; Gigabyte GA7VAXP
mobo, VIA KT400 chipset. I also had kernel-2.6.9-1.681_FC3 working.

Comment 15 Mace Moneta 2005-01-12 00:35:37 UTC

I isolated the problem causing the boot hang in post -681 released
kernels to the CONFIG_EDD option.  

In 2.6.9-681's config:
# CONFIG_EDD is not set

In 2.6.10-737's config:
CONFIG_EDD=m

It looks like the function will need a blacklist for incompatible systems.

Comment 16 Dave Jones 2005-01-12 00:50:28 UTC

excellent, thanks for chasing this down.  you should be able to boot
with edd=off boot parameter.  I'll add Matt Domsch to the cc of this
bug, as he's the upstream maintainer of this code.

Comment 17 Mace Moneta 2005-01-12 02:02:26 UTC

Yup, confirming edd=off allows an unmodified kernel to boot.

Comment 18 Mace Moneta 2005-01-12 04:29:32 UTC

Just a little follow-up information... dmidecode shows that (extracted):

 Vendor: American Megatrends Inc.
 Version: 1005.006
 Release Date: 11/29/2004
 EDD is supported

But attempting to 'modprobe edd' manually after boot returns:

 BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
 EDD information not available.

Not surprising that boot failed.  Perhaps in a case like this, EDD can
self-disable, since it's not going to be of any use.

Motherboard is ASUS K8V SE Deluxe, AMD64 3200+, latest available BIOS.

Comment 19 Matt Domsch 2005-01-12 19:23:35 UTC

Mace, can you try booting with 'edd=skipmbr' rather than 'edd=off', 
to help debug it a little further.  =off disables EDD completely, but 
=skipmbr just skips reading the boot sector of each disk, but leaves 
the EDD BIOS calls in place.
Thanks,
Matt

Comment 20 Mace Moneta 2005-01-12 21:04:12 UTC

Interesting.  Yes, edd=skipmbr works too.  Now a "modprobe edd" reports:

BIOS EDD facility v0.16 2004-Jun-25, 3 devices found

Since there's something to report now, as you request on your web
page, I'll attach the output of:

find /sys/firmware/edd -type f -not -name raw_data -print -exec cat
\{\} \;
find /sys/firmware/edd -type f -name raw_data -print -exec hexdump -C
\{\} \;
lspci -vv
lsmod
cat /proc/scsi/scsi
dmidecode

Comment 21 Mace Moneta 2005-01-12 21:05:30 UTC

Created attachment 109692 [details]
EDD data collection

Comment 22 Matt Domsch 2005-01-12 21:23:30 UTC

OK, so it's not the BIOS query code, but the MBR reading that is 
causing problems for you.  Good to know.  Some people have reported 
30 second pauses while reading the MBRs of each disk.  Are you sure 
you didn't just need to wait longer?

Some things I notice from the data immediately.  You've got disks on 
Promise adapters.  Nearly all EDD failure reports so far have been 
with Promise (one was on an ACARD).

Do you actually have 3 hard disks?  One 80GB IDE disk attached to the 
onboard controller as the boot disk, one 80GB IDE disk attached 
somewhere else (probably onboard controller), and one 250GB disk 
attached via USB.

The second 80GB disk is showing up on the wrong PCI address (0:2.0 
rather than where it really is, I can't tell though from this data).  
Shouldn't matter for this purpose though it does show your BIOS is at 
least somewhat buggy.

The USB controller disk is showing up at the wrong PCI address too 
(0:0.0 rather than the correct 00:0d.[0123] or 00:10.[01234]).

If you unplug the USB-connected disk, and remove 'edd=skipmbr', does 
it work?  That would narrow down the faulty BIOS component to the USB 
adapters.

Can you try with the secodary IDE controller either enabled or 
disabled?  One report of success happened when, in BIOS setup, the 
second controller was enabled, just nothing attached.  If disabled, 
it hung.

FWIW, the EDD code that reads the MBR uses bog-standard int13 fn02 
(READ SECTORS) calls to read the first sector of each BIOS-reported 
disk.  Most boot loaders only read the first disk using int13, so 
it's generally only a problem when reading disk >0.

Blacklists, as has been suggested, are really difficult to implement 
that early on in real mode kernel startup.  I'd really hate to have 
to write a DMI parser in real mode assembly, but would entertain a 
patch if someone else wanted to write such. :-)

Comment 23 Stephen Adler 2005-01-12 22:49:25 UTC

A data point.

I'm having boot problems with this kernel and my system has a promise
ide raid controller. I'm not sure if it makes any sense in removing it
since it does provide raid service to 4 IDE disks installed on my
system. Could it be possible to disable EDD through boot line
parameter? (i.e. override the compiled in EDD option?)

Comment 24 Mace Moneta 2005-01-12 23:15:11 UTC

I let it sit for 60 seconds (clock over the PC), with no response on
each boot.  Since it normally boots almost instantly, that seemed long
enough.

Yes, I have three HD; two internal 80GB on different controllers
arranged as RAID-1 (including the boot partition), and an external
(USB 2.0 HiSpeed) 250GB used for archival backup and temporary
storage.  However, from the BIOS perspective, I think it considers the
two 80GB drives and the CD/DVD player bootable.  This motherboard has
multiple onboard controllers

http://usa.asus.com/prog/spec.asp?m=K8V%20SE%20Deluxe&langs=09

I have a PCI (non-RAID) Promise controller as well.  Disabling
controllers will make the CD/DVD drive and/or the CDRW drive and/or
one of the RAID drives unusable, so while that might be an interesting
test, it's not really feasible (the system is a low usage web server).

Regardless of which component is at fault, the root problem is that
the EDD code is blocking the boot.  Rather than a blacklist, now that
I understand the function a little better, a fall-back mechanism would
seem a more practical approach.  For example, in the event of a
timeout (30 seconds seems excessive, even if successful), if the MBR
can't be read fallback to the non-EDD boot code.  The same behavior
should be used for any error condition that would prevent a boot
(e.g., invalid data from the BIOS).

It would be very useful, if possible, to add some status messages to
the process.  In a normal boot they will fly by and never be seen. 
But when things go wrong they are invaluable.  Something along the lines:

"Now attempting EDD boot", followed by either a "success" or "fallback
to non-EDD".  If fallbacks aren't implemented, then a message like the
following would be useful: 

"If this is the last response you see, boot with kernel parameter
edd=off or edd=skipmbr"

It would have saved about a week of effort hunting down the cause, in
this case.

I'm not entirely clear what value reading the MBR has, when GRUB has
already booted and provided the boot disk (nth BIOS mapped drive). 
Perhaps making skipmbr the default would be a better solution?

Comment 25 Nathaniel Daw 2005-01-13 15:46:37 UTC

Disabling EDD also solves my boot problems on my older 32-bit Athlon
XP. I also have a secondary Promise RAID/IDE controller on my mobo,
with some secondary drives attached there -- no usb drives or scsi or
anything though. If it would be helpful, I would be happy to post the
debug information, or otherwise play around with things.

In my case I also don't think I am mistaking a slow read as a lockup
-- I left it for a few minutes.

Comment 26 Need Real Name 2005-05-17 05:13:01 UTC

I have an older Pentium 4 machine with an MSI mainboard based on the SiS 645 chipset.  I had the same 
problem as described here where I could not boot any FC3 update kernel since 2.6.9-681.  In each case, 
including the latest 2.6.11 update kernel, it would hang right after grub's initrd statement.  However, once 
I added the edd=skipmbr trick given here to my kernel boot params, all was well and happy.  Thanks 
bugzilla!!!

Comment 27 Dave Jones 2005-07-15 18:19:58 UTC

An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Comment 28 Mace Moneta 2005-07-16 20:05:51 UTC

I upgraded the kernel to 2.6.12-1.1372_FC3, and tried booting without
edd=skipmbr.  The boot got to the same point, but output a string of garbage
(about 40 random characters), then hung again.  Rebooted with edd=skipmbr and it
came up fine.

Comment 29 Peter Smith 2005-07-29 19:55:46 UTC

I have a Dell PowerEdge 2300 which is not using hardware RAID, but DOES have two
Promise controllers in it.  Its boot/root device is a single SCSI HD on the
builtin AIC7XXX.  Any kernel I use after 2.6.9-1.681_FC3 does NOT boot (as
described above) _unless_ I use "edd=skipmbr".  This includes the recommended
kernel-2.6.12-1.1372_FC3 .  This is _not_ fixed in the 1372 kernel.  However, I
do have a workaround for the moment so I may finally be able to ugrade this box
from FC3 to FC4.  I have left it at the "hung" point for a very long time in the
past month or so in the hopes that it might finally continue, but I believe I'd
left it for upwards of an hour so a drive-timeout issue is NOT the case.  hth.

Comment 30 Matt Domsch 2005-08-08 20:27:31 UTC

Created attachment 117556 [details]
edd-get-disk-type-before-read.patch

I'll upload a patch which may help, if someone who is experiencing a boot
failure unless they use "edd=skipmbr" on the kernel command line.  This patch
does a "Get Disk Type" call to the BIOS, before doing the "Read Sectors" call. 
Per Ralf Brown's Interrupt List, this may be necessary for some BIOSs.

If you are able to build a kernel with this patch and report back success
without using "edd=skipmbr", I'd very much like to hear.

Thanks,
Matt

Comment 31 Mace Moneta 2005-08-27 17:18:37 UTC

Before I tested this patch, ASUS issued a BIOS update (1007 for the K8V SE
Deluxe motherboard), that corrected this problem.  They don't document that
they've changed anything in this area, but after applying the BIOS update, I can
now boot normally without using "edd=skipmbr", using the stock
kernel-2.6.12-1.1372_FC3.

Comment 32 Dave Jones 2005-10-25 07:55:07 UTC

Matt, this seems to have fallen by the wayside.  FC3 is going to reach end of
life in a month or two, and this bug will be closed then.  Might be worth trying
to get that diff upstream if you still think its needed.

We can always migrate this to an FC4 bug later if Peter is still around for testing.

Comment 33 Dave Jones 2006-01-16 22:35:02 UTC

This is a mass-update to all currently open Fedora Core 3 kernel bugs.

Fedora Core 3 support has transitioned to the Fedora Legacy project.
Due to the limited resources of this project, typically only
updates for new security issues are released.

As this bug isn't security related, it has been migrated to a
Fedora Core 4 bug.  Please upgrade to this newer release, and
test if this bug is still present there.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

Thank you.

Comment 34 Mace Moneta 2006-01-17 02:29:41 UTC

Unable to reproduce in FC4 with current motherboard BIOS; closing.

Note You need to log in before you can comment on or make changes to this bug.