From Bugzilla Helper: User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.4.6-xfs i686) Description of problem: install roswell1 using cdrom, select grub, system unbootable. used "linux rescue", installed lilo, system booted. that was my first experience with grub. just installed roswell2 using nfs on same machine. thought I'd give grub a second chance. system still fails. transcribing screen message (so I may get a keystroke or two wrong): ...[lots of stuff]... VFS: Mounted root (ext2 filesystem). Red Hat nash version 3.1.6 starting Loading jbd module Journalled Block Device driver loaded Loading ext3 module Mounting /proc filesystem Creating root device Mounting root filesystem hda2: bad access: block=2, count=2 end_request: I/O error, dev 03:02 (hda), sector 2 EXT3-fs: unable to read superblock mount: error 22 mounting ext3 pivotroot: pivot_root(/sysroot,/sysroot/initrd) failed: 2 Freeing unused kernel memory: 232k freed Kernel panic: No init found. Try passing init= option to kernel. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. use grub when installing Actual Results: system unbootable Expected Results: system boots Additional info: I don't mean to be unreasonably unfair to grub. But in my two attempts to use grub, both have failed.
Created attachment 28574 [details] grub.conf before I run lilo and make box bootable
More importantly, what type of hardware is this on? (also changing component to Red Hat Linux Beta, RC1 and restricting permissions since this isn't public yet)
just did a "linux rescue via nfs, and ran lilo. (also snagged /boot/grub/grub.conf) no other changes. rebooted and yields exactly the same problem! So this time it is not grub's fault. I don't obviously see what the problem might be. The labels on the filesystems match the fstab. the root filesystem is in fact /dev/hda2. booting linux rescue has no trouble mounting everything, but I see that it mounts as ext2 not ext3. hardware is supermicro p6sba, maxtor 5t060h6, 256 mb memory. I'll leave the system like this in case you folks have something to suggest.
If you run tune2fs -l /dev/hda2 from rescue mode, does the filesystem actually have a journal on it?
Created attachment 28778 [details] output of tune2fs -l /dev/hda2
We (Red Hat) really need to fix this before next release.
If you try to access the initrd, can you loopback mount it? It almost looks like the initrd might be on a bad sector of the disk
md5sum of initrd-2.4.7-2.img works fine. in "linux rescue" chroot /mnt/sysimage zcat /boot/initrd-2.4.7-2.img > /tmp/in mkdir /tmp/i mount -l loop /tmp/in /tmp/i works fine.
Arjan, any ideas on this one?
please mount the initrd and attach the linuxrc from it.
Also is this a board with Promise Fasttrak RAID or Highpoint 370 RAID ?
*** Bug 52867 has been marked as a duplicate of this bug. ***
No promise fasttrak No highpoint 370 I'll attach the initrd, rather than just bits of it. A bit odd to me that it mounts /proc, and then echoes a message saying it will mount /proc. Otherwise looks OK to me.
Created attachment 30375 [details] initrd of non-booting machine
I've seen several such reports, they were (so far) all fixed in the 2.4.7-6 kernel. Could you attach a lspci anyway ?
Created attachment 30874 [details] lspci of system
I hope everyone will bear with me on this one (I really did read everything above, even the attachments). I am having this problem with RH-7.2 on 3 out of 5 machines to various degrees, from the iso CD's on the ftp site. I also see other problems that seem to be related (and are very annoying BTW). Machines with no problems so far: vanilla Dell machines with IDE disks, root partition is near the beginning (near meaning < 10th cylinder). Very light use, one is a local anonymous ftp mirror for local updates (util.census.gov) (no sense in burning up Redhat's or Sun's servers, I maintain a LOT of RH and Sun machines). Machines with problems: All have SCSI disks, different SCSI controllers (Tekram 390, AHA 16390, aic7xxx), with different CPU's - Intel coppermine, AMD athlon, the Intel machines are SMP. Machines are Dell 6400 X 4cpu's, 550 MHz with Megaraid (120 gig raid), 2 mirrored 9 gig drives on the aic7xxx, 500 Meg ram. The other intel is a 2X700 generic machine with the 16390 controller to a 36 gig disk, 256 Meg ram. My personal machine is the AMD-600 with 500 Meg of ram, 2-ide and 1 SCSI disk. All of the problem machines were upgraded from at least 6.2 to 7.0 to 7.1. My machine goes back to 5.0 (and beyond, but that is when I did a clean install again). Problems: 1) The Dell 6400 - The machine upgraded very well from 7.1, booted and everything was nice in paradise. I downloaded all the new patches, one of which was a kernel update (2.4.9-13). The patch even updated the lilo file, but when it came up it would consistently get to the pivotroot problem above. I spent more time than I'll admit trying to figure out why the previous version - 2.4.7-10 continues to boot and the new one won't. This includes looking at the initrd stuff and building new initrd's from hand. NOTHING worked. The machine continues to run the old kernel. I didn't see any difference in the module loads, or any other script I could lay my hands on. 2) The generic SMP machine. When I upgraded this machine, it complained about the partition table being a bit wacky... but it said it wasn't a fatal error and that I should continue. I did and it upgraded. Then I got a fatal error because I tried to upgrade all the file systems and I had /var/log mounted on top of the mounted partition /var. It didn't like that so I reran the installation and DIDN'T upgrade /var/log... everything went fine. Next I saw some assorted Oops messages and also memory problems like (from a dmesg command): 00:0c.0: 3Com PCI 3c905B Cyclone 100baseTx at 0xa800. Vers LK1.1.16 PCI: Setting latency timer of device 00:0c.0 to 64 swap_free: Unused swap offset entry 00400000 VM: killing process python swap_free: Unused swap offset entry 00400000 XD: Loaded as a module. Trying to free nonexistent resource <00000320-00000323> XD: Loaded as a module. Trying to free nonexistent resource <00000320-00000323> Here is an Oops message after I did a "rm -rf oldstuff" Oldstuff had about 30 files in it restored the day before from tape (was a reiserfs file system). ------------[ cut here ]------------ kernel BUG at page_alloc.c:87! invalid operand: 0000 CPU: 0 EIP: 0010:[<c012bc7c>] Not tainted EFLAGS: 00010282 eax: 0000001f ebx: c12393d0 ecx: 00000001 edx: 00002041 esi: c12393d0 edi: 00000000 ebp: 00000000 esp: c1825f70 ds: 0018 es: 0018 ss: 0018 Process kswapd (pid: 5, stackpage=c1825000) Stack: c022e751 00000057 c12393d0 00000080 c0133d92 00000000 c12393d0 c12393f8 c12393d0 00000000 00000007 c012b169 00000000 00000000 000003dc 000049d8 00000000 00000006 000000c0 00000000 0008e000 c012b784 000000c0 00000000 Call Trace: [<c022e751>] .rodata.str1.1 [kernel] 0x1fcc [<c0133d92>] try_to_release_page [kernel] 0x3a [<c012b169>] page_launder [kernel] 0x5c5 [<c012b784>] do_try_to_free_pages [kernel] 0x10 [<c012b811>] kswapd [kernel] 0x51 [<c0105000>] stext [kernel] 0x0 [<c010566e>] kernel_thread [kernel] 0x26 [<c012b7c0>] kswapd [kernel] 0x0 Code: 0f 0b 31 c0 0f b3 46 18 19 c0 85 c0 75 15 68 f1 01 00 00 68 Now, here is a goody that seems to tie the above with EXT3: Nov 15 00:46:04 liberty kernel: Unable to handle kernel paging request at virtual address 00400000 Nov 15 00:46:04 liberty kernel: printing eip: Nov 15 00:46:04 liberty kernel: d083edcb Nov 15 00:46:04 liberty kernel: *pde = 00000000 Nov 15 00:46:04 liberty kernel: Oops: 0000 Nov 15 00:46:04 liberty kernel: CPU: 0 Nov 15 00:46:04 liberty kernel: EIP: 0010:[3c59x:__insmod_3c59x_O/lib/modules/2.4.9-13/kernel/drivers/net/3c+-1413685/96] Not tainted Nov 15 00:46:04 liberty kernel: EIP: 0010:[<d083edcb>] Not tainted Nov 15 00:46:04 liberty kernel: EFLAGS: 00010206 Nov 15 00:46:04 liberty kernel: eax: 00000000 ebx: 00400000 ecx: c28a5300 edx: 00000000 Nov 15 00:46:04 liberty kernel: esi: ce6bdc00 edi: 00000001 ebp: 00000007 esp: c1825f3c Nov 15 00:46:04 liberty kernel: ds: 0018 es: 0018 ss: 0018 Nov 15 00:46:04 liberty kernel: Process kswapd (pid: 5, stackpage=c1825000) Nov 15 00:46:04 liberty kernel: Stack: c28a5300 ce6bdc00 d083c9a0 c28a5300 ce6bdc00 ce6bdc00 d083ca34 ce6bdc00 Nov 15 00:46:04 liberty kernel: c1825f60 00000000 c1129b84 00000080 00000000 d084a140 cf9ba200 c1129b84 Nov 15 00:46:04 liberty kernel: 00000080 c0133d92 c1129b84 00000080 00000000 c1129b84 c012af92 c1129b84 Nov 15 00:46:04 liberty kernel: Call Trace: [3c59x:__insmod_3c59x_O/lib/modules/2.4.9-13/kernel/drivers/net/3c+-1422944/96] journal_force_commit_R730a59d9 [jbd] 0x21c Nov 15 00:46:04 liberty kernel: Call Trace: [<d083c9a0>] journal_force_commit_R730a59d9 [jbd] 0x21c Nov 15 00:46:05 liberty kernel: [3c59x:__insmod_3c59x_O/lib/modules/2.4.9-13/kernel/drivers/net/3c+-1422796/96] journal_try_to_free_buffers_R9ddb5382 [jbd] 0x6c Nov 15 00:46:05 liberty kernel: [<d083ca34>] journal_try_to_free_buffers_R9ddb5382 [jbd] 0x6c Nov 15 00:46:05 liberty kernel: [3c59x:__insmod_3c59x_O/lib/modules/2.4.9-13/kernel/drivers/net/3c+-1367744/96] __insmod_ext3_S.text_L40820 [ext3] 0x40e0 Nov 15 00:46:05 liberty kernel: [<d084a140>] __insmod_ext3_S.text_L40820 [ext3] 0x40e0 Nov 15 00:46:05 liberty kernel: [try_to_release_page+58/88] try_to_release_page [kernel] 0x3a Nov 15 00:46:05 liberty kernel: [<c0133d92>] try_to_release_page [kernel] 0x3a Nov 15 00:46:05 liberty kernel: [page_launder+1006/2248] page_launder [kernel] 0x3ee Nov 15 00:46:05 liberty kernel: [<c012af92>] page_launder [kernel] 0x3ee Nov 15 00:46:05 liberty kernel: [do_try_to_free_pages+16/76] do_try_to_free_pages [kernel] 0x10 Nov 15 00:46:05 liberty kernel: [<c012b784>] do_try_to_free_pages [kernel] 0x10 Nov 15 00:46:05 liberty kernel: [kswapd+81/228] kswapd [kernel] 0x51 Nov 15 00:46:05 liberty kernel: [<c012b811>] kswapd [kernel] 0x51 Nov 15 00:46:06 liberty kernel: [_stext+0/40] stext [kernel] 0x0 Nov 15 00:46:06 liberty kernel: [<c0105000>] stext [kernel] 0x0 Nov 15 00:46:06 liberty kernel: [kernel_thread+38/48] kernel_thread [kernel] 0x26 Nov 15 00:46:06 liberty kernel: [<c010566e>] kernel_thread [kernel] 0x26 Nov 15 00:46:06 liberty kernel: [kswapd+0/228] kswapd [kernel] 0x0 Nov 15 00:46:06 liberty kernel: [<c012b7c0>] kswapd [kernel] 0x0 Nov 15 00:46:06 liberty kernel: Nov 15 00:46:06 liberty kernel: Nov 15 00:46:06 liberty kernel: Code: 8b 33 c7 41 24 00 00 00 00 89 42 2c 8b 41 28 8b 51 2c 89 42 Nov 15 00:47:05 liberty kernel: <2>EXT3-fs error (device sd(8,9)): ext3_free_blocks: bit already cleared for block 5111178 I have seen this a number of times - the bit is already cleared. This brings me to my personal machine - the AMD Athlon machine. Everything was working just fine, the machine hasn't been rebooted since the 10/22. Some staroffice stuff got a bit flakey and locked X up, after killing X left a lot of turds running so I simply typed reboot (yea yea... after su to root of course). I couldn't get the machine into a usable state for about 5 hours. Sure it would boot, then I would get the init error. So I tried other kernels, they would bring it up but couldn't deal with the ext3 stuff. I tried updating with the original CD's... with an enterprise kernel, even the debug kernel that got by the init problem but then it got stuck on the switching of the root. The debug kernel put me into a debug session... but did little more than that. I am running a 2.4.3 kernel so I can look for help. My guess is that it has something to do with journaling and kernel paging. The SMP machines seem to have an issue with bits already being cleared. One thing that seems consistent is that ext3 is in the mix. The machines that I haven't upgraded to ext3 work fine. The more ram the machine has the less I see it. I also wonder if it has something to do with a memory leak as it seems to clobber my ethernet driver (above, the least RAM). For now I am rolling all my File systems back to ext2 with the exception of /, because if I move that back to ext2 it won't come up saying that it isn't an ext3 file system even if the /etc/fstab is set to ext2. Seems to be bent on not allowing anything else. -Robert Thomas U.S. Census Bureau (301) 763-5711 *FOB-3, Room 1364 Washington, DC 20033
thoma041, could you please file the oops as a separate bug against the kernel. For the machines which don't boot, do they have multiple scsi adapators? If so, could you try with the boot images at http://people.redhat.com/~katzj/bootimages/ and see if they help any?
Created attachment 39064 [details] Oops dump from /var/log/messages.*
I think I know how to solve this bug's problem of booting. I have noticed that with mkinitrd seems to always set the number of the device that gets put into /proc/sys/kernel/real-root-dev to 0x0100. The number may be different, yesterday I updated to a new kernel with the newer version of the mkinitrd tools for 7.1 and it did this. The right number is 0x080a, it is a Dell 2X machine. If the script grabbed the number from the running machine like I did, it should work (I made 2 boot tags, the original and one that had my alternative initrd file). Turned out to not be a grub error.
Does this work any better using the updated grub packages at http://people.redhat.com/katzj/grub/ ? (you'll need to install the packages and then run '/sbin/grub-install /dev/of/mbr')
After some initial playing with it, we determined that grub was not for us. (I never have understood what issues caused redhat to switch their official blessing from lilo to grub.) I've installed RH72 hundreds of times on dozens of machines since then, all using lilo, and have not seen this error reoccur. Let me know if you think this particular machine is unusual. If so I can try out grub. Otherwise, given time constraints, it is unlikely I'll be playing around with grub.
GRUB is technically a much better boot loader. It can do things like reading filesystems that boot loaders on other architectures have been able to do for years. This reduces the probability of user error when doing things like compiling new kernels, etc. Also, large parts of it are written in C instead of in asm, which makes it a lot easier to even think about doing things like adding native software RAID 5 support at some point in the future I understand time constraints, though, believe me... there have been some similar reports than 0.91 seems to fix and there's another one that I need to make sure my patch compiles and then see if it works for. If you just quickly try once during the next beta cycle, that would be great and reopen this / file a new bug if it still appears to be a problem.