52143 – fresh install using grub leaves unbootable system

Bug 52143 - fresh install using grub leaves unbootable system

Summary: fresh install using grub leaves unbootable system

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	grub
Sub Component:
Version:	7.3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeremy Katz
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	52867 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-08-21 01:11 UTC by Jim Wright
Modified:	2008-05-01 15:38 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2002-01-25 10:47:18 UTC
Embargoed:

Attachments	(Terms of Use)
grub.conf before I run lilo and make box bootable (540 bytes, text/plain) 2001-08-21 01:13 UTC, Jim Wright	no flags	Details
output of tune2fs -l /dev/hda2 (1.22 KB, text/plain) 2001-08-21 19:12 UTC, Jim Wright	no flags	Details
initrd of non-booting machine (307.33 KB, application/octet-stream) 2001-08-31 17:36 UTC, Jim Wright	no flags	Details
lspci of system (4.38 KB, text/plain) 2001-09-04 19:47 UTC, Jim Wright	no flags	Details
Oops dump from /var/log/messages.* (372.98 KB, text/plain) 2001-11-29 16:18 UTC, Robert Thomas	no flags	Details
View All

Description Jim Wright 2001-08-21 01:11:57 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.4.6-xfs i686)

Description of problem:
install roswell1 using cdrom, select grub, system unbootable.  used "linux
rescue", installed lilo, system booted.  that was my first experience with
grub.

just installed roswell2 using nfs on same machine.  thought I'd give grub a
second chance.  system still fails.

transcribing screen message (so I may get a keystroke or two wrong):

...[lots of stuff]...
VFS: Mounted root (ext2 filesystem).
Red Hat nash version 3.1.6 starting
Loading jbd module
Journalled Block Device driver loaded
Loading ext3 module
Mounting /proc filesystem
Creating root device
Mounting root filesystem
hda2: bad access: block=2, count=2
end_request: I/O error, dev 03:02 (hda), sector 2
EXT3-fs: unable to read superblock
mount: error 22 mounting ext3
pivotroot: pivot_root(/sysroot,/sysroot/initrd) failed: 2
Freeing unused kernel memory: 232k freed
Kernel panic: No init found.  Try passing init= option to kernel.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. use grub when installing
	

Actual Results:  system unbootable

Expected Results:  system boots

Additional info:

I don't mean to be unreasonably unfair to grub.  But in my two attempts to
use grub, both have failed.

Comment 1 Jim Wright 2001-08-21 01:13:13 UTC

Created attachment 28574 [details]
grub.conf before I run lilo and make box bootable

Comment 2 Jeremy Katz 2001-08-21 01:17:01 UTC

More importantly, what type of hardware is this on?

(also changing component to Red Hat Linux Beta, RC1 and restricting permissions
since this isn't public yet)

Comment 3 Jim Wright 2001-08-21 01:28:43 UTC

just did a "linux rescue via nfs, and ran lilo.  (also snagged
/boot/grub/grub.conf)
no other changes.  rebooted and yields exactly the same problem!  So this time
it
is not grub's fault.  I don't obviously see what the problem might be.  The
labels on the filesystems match the fstab.  the root filesystem is in fact
/dev/hda2. booting linux rescue has no trouble mounting everything, but
I see that it mounts as ext2 not ext3.

hardware is supermicro p6sba,  maxtor 5t060h6, 256 mb memory.

I'll leave the system like this in case you folks have something to suggest.

Comment 4 Jeremy Katz 2001-08-21 17:56:40 UTC

If you run tune2fs -l /dev/hda2 from rescue mode, does the filesystem actually
have a journal on it?

Comment 5 Jim Wright 2001-08-21 19:12:38 UTC

Created attachment 28778 [details]
output of tune2fs -l /dev/hda2

Comment 6 Glen Foster 2001-08-21 20:26:06 UTC

We (Red Hat) really need to fix this before next release.

Comment 7 Jeremy Katz 2001-08-23 20:13:59 UTC

If you try to access the initrd, can you loopback mount it?  It almost looks
like the initrd might be on a bad sector of the disk

Comment 8 Jim Wright 2001-08-24 01:36:03 UTC

md5sum of initrd-2.4.7-2.img works fine.

in "linux rescue"
	chroot /mnt/sysimage
	zcat /boot/initrd-2.4.7-2.img > /tmp/in
	mkdir /tmp/i
	mount -l loop /tmp/in /tmp/i
works fine.

Comment 9 Jeremy Katz 2001-08-24 22:15:38 UTC

Arjan, any ideas on this one?

Comment 10 Matt Wilson 2001-08-28 14:32:35 UTC

please mount the initrd and attach the linuxrc from it.

Comment 11 Arjan van de Ven 2001-08-28 14:34:42 UTC

Also is this a board with Promise Fasttrak RAID or Highpoint 370 RAID ?

Comment 12 Bernhard Rosenkraenzer 2001-08-30 09:27:15 UTC

*** Bug 52867 has been marked as a duplicate of this bug. ***

Comment 13 Jim Wright 2001-08-31 17:15:28 UTC

No promise fasttrak
No highpoint 370

I'll attach the initrd, rather than just bits of it.

A bit odd to me that it mounts /proc, and then echoes a message saying it will
mount /proc.

Otherwise looks OK to me.

Comment 14 Jim Wright 2001-08-31 17:36:26 UTC

Created attachment 30375 [details]
initrd of non-booting machine

Comment 15 Arjan van de Ven 2001-09-03 10:27:58 UTC

I've seen several such reports, they were (so far) all fixed in the 2.4.7-6
kernel. Could you attach a lspci anyway ?

Comment 16 Jim Wright 2001-09-04 19:47:57 UTC

Created attachment 30874 [details]
lspci of system

Comment 17 Robert Thomas 2001-11-21 06:09:31 UTC

I hope everyone will bear with me on this one (I really did read everything
above, even the attachments).

I am having this problem with RH-7.2 on 3 out of 5 machines to various degrees,
from the iso CD's on the ftp site.  I also see other problems that seem to be
related (and are very annoying BTW).

Machines with no problems so far: vanilla Dell machines with IDE disks, root
partition is near the beginning (near meaning < 10th cylinder).  Very light use,
one is a local anonymous ftp mirror for local updates (util.census.gov) (no
sense in burning up Redhat's or Sun's servers, I maintain a LOT of RH and Sun
machines).

Machines with problems: All have SCSI disks, different SCSI controllers (Tekram
390, AHA 16390, aic7xxx), with different CPU's - Intel coppermine, AMD athlon,
the Intel machines are SMP.  Machines are Dell 6400 X 4cpu's, 550 MHz with
Megaraid (120 gig raid), 2 mirrored 9 gig drives on the aic7xxx, 500 Meg ram. 
The other intel is a 2X700 generic machine with the 16390 controller to a 36 gig
disk, 256 Meg ram.  My personal machine is the AMD-600 with 500 Meg of ram, 
2-ide and 1 SCSI disk.  All of the problem machines were upgraded from at least
6.2 to 7.0 to 7.1.  My machine goes back to 5.0 (and beyond, but that is when I
did a clean install again).


Problems: 

1) The Dell 6400 - The machine upgraded very well from 7.1, booted and
everything was nice in paradise.  I downloaded all the new patches, one of which
was a kernel update (2.4.9-13).   The patch even updated the lilo file, but when
it came up it would consistently get to the pivotroot problem above.  I spent
more time than I'll admit trying to figure out why the previous version -
2.4.7-10 continues to boot and the new one won't.  This includes looking at the
initrd stuff and building new initrd's from hand.  NOTHING worked.  The machine
continues to run the old kernel.  I didn't see any difference in the module
loads, or any other script I could lay my hands on.

2) The generic SMP machine.  When I upgraded this machine, it complained about
the partition table being a bit wacky... but it said it wasn't a fatal error and
that I should continue.  I did and it upgraded.  Then I got a fatal error
because I tried to upgrade all the file systems and I had /var/log mounted on
top of the mounted partition /var.  It didn't like that so I reran the
installation and DIDN'T upgrade /var/log... everything went fine.  Next I saw
some assorted Oops messages and also memory problems like (from a dmesg
command):

00:0c.0: 3Com PCI 3c905B Cyclone 100baseTx at 0xa800. Vers LK1.1.16
PCI: Setting latency timer of device 00:0c.0 to 64
swap_free: Unused swap offset entry 00400000
VM: killing process python
swap_free: Unused swap offset entry 00400000
XD: Loaded as a module.
Trying to free nonexistent resource <00000320-00000323>
XD: Loaded as a module.
Trying to free nonexistent resource <00000320-00000323>


Here is an Oops message after I did a "rm -rf oldstuff" Oldstuff had about 30
files in it restored the day before from tape (was a reiserfs file system).

------------[ cut here ]------------
kernel BUG at page_alloc.c:87!
invalid operand: 0000
CPU:    0
EIP:    0010:[<c012bc7c>]    Not tainted
EFLAGS: 00010282
eax: 0000001f   ebx: c12393d0   ecx: 00000001   edx: 00002041
esi: c12393d0   edi: 00000000   ebp: 00000000   esp: c1825f70
ds: 0018   es: 0018   ss: 0018
Process kswapd (pid: 5, stackpage=c1825000)
Stack: c022e751 00000057 c12393d0 00000080 c0133d92 00000000 c12393d0 c12393f8
       c12393d0 00000000 00000007 c012b169 00000000 00000000 000003dc 000049d8
       00000000 00000006 000000c0 00000000 0008e000 c012b784 000000c0 00000000
Call Trace: [<c022e751>] .rodata.str1.1 [kernel] 0x1fcc
[<c0133d92>] try_to_release_page [kernel] 0x3a
[<c012b169>] page_launder [kernel] 0x5c5
[<c012b784>] do_try_to_free_pages [kernel] 0x10
[<c012b811>] kswapd [kernel] 0x51
[<c0105000>] stext [kernel] 0x0
[<c010566e>] kernel_thread [kernel] 0x26
[<c012b7c0>] kswapd [kernel] 0x0

Code: 0f 0b 31 c0 0f b3 46 18 19 c0 85 c0 75 15 68 f1 01 00 00 68


Now, here is a goody that seems to tie the above with EXT3:

Nov 15 00:46:04 liberty kernel: Unable to handle kernel paging request at
virtual address 00400000
Nov 15 00:46:04 liberty kernel:  printing eip:
Nov 15 00:46:04 liberty kernel: d083edcb
Nov 15 00:46:04 liberty kernel: *pde = 00000000
Nov 15 00:46:04 liberty kernel: Oops: 0000
Nov 15 00:46:04 liberty kernel: CPU:    0
Nov 15 00:46:04 liberty kernel: EIP:   
0010:[3c59x:__insmod_3c59x_O/lib/modules/2.4.9-13/kernel/drivers/net/3c+-1413685/96]   
Not tainted
Nov 15 00:46:04 liberty kernel: EIP:    0010:[<d083edcb>]    Not tainted
Nov 15 00:46:04 liberty kernel: EFLAGS: 00010206
Nov 15 00:46:04 liberty kernel: eax: 00000000   ebx: 00400000   ecx: c28a5300
edx: 00000000
Nov 15 00:46:04 liberty kernel: esi: ce6bdc00   edi: 00000001   ebp: 00000007
esp: c1825f3c
Nov 15 00:46:04 liberty kernel: ds: 0018   es: 0018   ss: 0018
Nov 15 00:46:04 liberty kernel: Process kswapd (pid: 5, stackpage=c1825000)
Nov 15 00:46:04 liberty kernel: Stack: c28a5300 ce6bdc00 d083c9a0 c28a5300
ce6bdc00 ce6bdc00 d083ca34 ce6bdc00
Nov 15 00:46:04 liberty kernel:        c1825f60 00000000 c1129b84 00000080
00000000 d084a140 cf9ba200 c1129b84
Nov 15 00:46:04 liberty kernel:        00000080 c0133d92 c1129b84 00000080
00000000 c1129b84 c012af92 c1129b84
Nov 15 00:46:04 liberty kernel: Call Trace:
[3c59x:__insmod_3c59x_O/lib/modules/2.4.9-13/kernel/drivers/net/3c+-1422944/96]
journal_force_commit_R730a59d9 [jbd] 0x21c
Nov 15 00:46:04 liberty kernel: Call Trace: [<d083c9a0>]
journal_force_commit_R730a59d9 [jbd] 0x21c
Nov 15 00:46:05 liberty kernel:
[3c59x:__insmod_3c59x_O/lib/modules/2.4.9-13/kernel/drivers/net/3c+-1422796/96]
journal_try_to_free_buffers_R9ddb5382 [jbd] 0x6c
Nov 15 00:46:05 liberty kernel: [<d083ca34>]
journal_try_to_free_buffers_R9ddb5382 [jbd] 0x6c
Nov 15 00:46:05 liberty kernel:
[3c59x:__insmod_3c59x_O/lib/modules/2.4.9-13/kernel/drivers/net/3c+-1367744/96]
__insmod_ext3_S.text_L40820 [ext3] 0x40e0
Nov 15 00:46:05 liberty kernel: [<d084a140>] __insmod_ext3_S.text_L40820 [ext3]
0x40e0
Nov 15 00:46:05 liberty kernel: [try_to_release_page+58/88] try_to_release_page
[kernel] 0x3a
Nov 15 00:46:05 liberty kernel: [<c0133d92>] try_to_release_page [kernel] 0x3a
Nov 15 00:46:05 liberty kernel: [page_launder+1006/2248] page_launder [kernel]
0x3ee
Nov 15 00:46:05 liberty kernel: [<c012af92>] page_launder [kernel] 0x3ee
Nov 15 00:46:05 liberty kernel: [do_try_to_free_pages+16/76]
do_try_to_free_pages [kernel] 0x10
Nov 15 00:46:05 liberty kernel: [<c012b784>] do_try_to_free_pages [kernel] 0x10
Nov 15 00:46:05 liberty kernel: [kswapd+81/228] kswapd [kernel] 0x51
Nov 15 00:46:05 liberty kernel: [<c012b811>] kswapd [kernel] 0x51
Nov 15 00:46:06 liberty kernel: [_stext+0/40] stext [kernel] 0x0
Nov 15 00:46:06 liberty kernel: [<c0105000>] stext [kernel] 0x0
Nov 15 00:46:06 liberty kernel: [kernel_thread+38/48] kernel_thread [kernel]
0x26
Nov 15 00:46:06 liberty kernel: [<c010566e>] kernel_thread [kernel] 0x26
Nov 15 00:46:06 liberty kernel: [kswapd+0/228] kswapd [kernel] 0x0
Nov 15 00:46:06 liberty kernel: [<c012b7c0>] kswapd [kernel] 0x0
Nov 15 00:46:06 liberty kernel:
Nov 15 00:46:06 liberty kernel:
Nov 15 00:46:06 liberty kernel: Code: 8b 33 c7 41 24 00 00 00 00 89 42 2c 8b 41
28 8b 51 2c 89 42
Nov 15 00:47:05 liberty kernel:  <2>EXT3-fs error (device sd(8,9)):
ext3_free_blocks: bit already cleared for block 5111178


I have seen this a number of times - the bit is already cleared.

This brings me to my personal machine - the AMD Athlon machine.

Everything was working just fine, the machine hasn't been rebooted since the
10/22.  Some staroffice stuff got a bit flakey and locked X up, after killing X
left a lot of turds running so I simply typed reboot (yea yea... after su to
root of course).  I couldn't get the machine into a usable state for about 5
hours.  Sure it would boot, then I would get the init error.  So I tried other
kernels, they would bring it up but couldn't deal with the ext3 stuff.  I tried
updating with the original CD's... with an enterprise kernel, even the debug
kernel that got by the init problem but then it got stuck on the switching of
the root.  The debug kernel put me into a debug session... but did little more
than that.  I am running a 2.4.3 kernel so I can look for help.

My guess is that it has something to do with journaling and kernel paging.  The
SMP machines seem to have an issue with bits already being cleared.  One thing
that seems consistent is that ext3 is in the mix.  The machines that I haven't
upgraded to ext3 work fine.  The more ram the machine has the less I see it.  I
also wonder if it has something to do with a memory leak as it seems to clobber
my ethernet driver (above, the least RAM).

For now I am rolling all my File systems back to ext2 with the exception of /,
because if I move that back to ext2 it won't come up saying that it isn't an
ext3 file system even if the /etc/fstab is set to ext2.  Seems to be bent on not
allowing anything else.

-Robert Thomas
 U.S. Census Bureau
 (301) 763-5711
*FOB-3, Room 1364
 Washington, DC 20033

Comment 18 Jeremy Katz 2001-11-29 01:49:27 UTC

thoma041, could you please file the oops as a separate bug against
the kernel.  For the machines which don't boot, do they have multiple scsi
adapators?  If so, could you try with the boot images at
http://people.redhat.com/~katzj/bootimages/ and see if they help any?

Comment 19 Robert Thomas 2001-11-29 16:18:13 UTC

Created attachment 39064 [details]
Oops dump from /var/log/messages.*

Comment 20 Robert Thomas 2001-11-29 16:27:47 UTC

I think I know how to solve this bug's problem of booting.  I have noticed that with mkinitrd seems 
to always set the number of the device that gets put into /proc/sys/kernel/real-root-dev to 
0x0100.  The number may be different, yesterday I updated to a new kernel with the newer version 
of the mkinitrd tools for 7.1 and it did this.  The right number is 0x080a, it is a Dell 2X machine.  
If the script grabbed the number from the running machine like I did, it should work (I made 2 boot 
tags, the original and one that had my alternative initrd file).  Turned out to not be a grub 
error.

Comment 21 Jeremy Katz 2002-01-22 01:13:22 UTC

Does this work any better using the updated grub packages at
http://people.redhat.com/katzj/grub/ ?  (you'll need to install the packages and
then run '/sbin/grub-install /dev/of/mbr')

Comment 22 Jim Wright 2002-01-25 10:47:12 UTC

After some initial playing with it, we determined that grub was not for us.  (I
never have understood what issues caused redhat to switch their official
blessing from lilo to grub.)  I've installed RH72 hundreds of times on dozens of
machines since then, all using lilo, and have not seen this error reoccur.

Let me know if you think this particular machine is unusual.  If so I can try
out grub.  Otherwise, given time constraints, it is unlikely I'll be playing
around with grub.

Comment 23 Jeremy Katz 2002-01-25 15:31:01 UTC

GRUB is technically a much better boot loader.  It can do things like reading
filesystems that boot loaders on other architectures have been able to do for
years.  This reduces the probability of user error when doing things like
compiling new kernels, etc.  Also, large parts of it are written in C instead of
in asm, which makes it a lot easier to even think about doing things like adding
native software RAID 5 support at some point in the future

I understand time constraints, though, believe me...  there have been some
similar reports than 0.91 seems to fix and there's another one that I need to
make sure my patch compiles and then see if it works for.  If you just quickly
try once during the next beta cycle, that would be great and reopen this / file
a new bug if it still appears to be a problem.

Note You need to log in before you can comment on or make changes to this bug.