Bug 344181 - 2.6.23.1-23.fc8 is unbootable, probable file system bug
Summary: 2.6.23.1-23.fc8 is unbootable, probable file system bug
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 8
Hardware: x86_64
OS: Linux
low
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-10-21 03:39 UTC by Joshua Rosen
Modified: 2008-08-02 23:40 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-27 04:20:27 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Boot screen (3.95 MB, image/jpeg)
2007-10-21 21:17 UTC, Joshua Rosen
no flags Details
Error screen snapshot (349.87 KB, image/jpeg)
2007-10-27 01:57 UTC, Joshua Rosen
no flags Details

Description Joshua Rosen 2007-10-21 03:39:41 UTC
Description of problem:
I have a 64 bit F8 Test3 system that works fine with the 2.6.23-6.fc8 kernel, it
won't boot with the 2.6.23.1-23.fc8 kernel. During boot it fails the file system
check. Running fsck on all of the partitions reveals no problems. If the
2.6.23-6.fc8 kernel is selected from the grub menu the system boots fine.

Version-Release number of selected component (if applicable):


How reproducible: 100%


Steps to Reproduce:
1.Select 2.6.23.1-23.fc8 kernel during boot
2.Fails the file system check
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Chuck Ebbert 2007-10-21 15:41:54 UTC
"Fails the file system check" is not very useful information. What does it print?

If you can, take a picture of the screen with a digital camera and attach that.


Comment 2 Joshua Rosen 2007-10-21 21:17:39 UTC
Created attachment 233671 [details]
Boot screen

This is a little fuzzy but it's readable. The error message looks similar to an
fsck failure except that it doesn't specify a partition. Running fsck manually
on all of the partitions finds no errors.

Comment 3 Joshua Rosen 2007-10-21 21:26:01 UTC
I've uploaded a snapshot of the screen on the boot failure.

Here is the list of the partitions on this system.

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda5              7749536   4412372   2937148  61% /
/dev/sda7              7874528    148800   7325712   2% /os_y
/dev/sda8            441346556   2779924 416147528   1% /user
/dev/sda1              7874528   3018168   4456344  41% /gutsy
/dev/sda6              7874528    148800   7325712   2% /os_x
tmpfs                  1497208         0   1497208   0% /dev/shm

Here is some hardware info,

00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2)
00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2)
00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2)
00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2)
00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2)
00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2)
00:02.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:03.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:05.0 VGA compatible controller: nVidia Corporation C51 [Quadro NVS
210S/GeForce 6150LE] (rev a2)
00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2)
00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a2)
00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a2)
00:0a.2 RAM memory: nVidia Corporation MCP51 Memory Controller 0 (rev a2)
00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a2)
00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a2)
00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1)
00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1)
00:0f.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1)
00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2)
00:10.1 Audio device: nVidia Corporation MCP51 High Definition Audio (rev a2)
00:14.0 Bridge: nVidia Corporation MCP51 Ethernet Controller (rev a1)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 15
model name	: AMD Athlon(tm) 64 Processor 3800+
stepping	: 0
cpu MHz		: 1000.000
cache size	: 512 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good
bogomips	: 2010.49
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts fid vid ttp



Comment 4 Chuck Ebbert 2007-10-22 16:12:09 UTC
(In reply to comment #2)
> Created an attachment (id=233671) [edit]
> Boot screen
> 
> This is a little fuzzy but it's readable. The error message looks similar to an
> fsck failure except that it doesn't specify a partition. Running fsck manually
> on all of the partitions finds no errors.
> 

This message is usually from a bad entry in /etc/fstab.


Comment 5 Joshua Rosen 2007-10-23 00:14:58 UTC
The /etc/fstab was generated by the f* installer. It works fine with the older
kernel, the problem is with this kernel.

LABEL=/                 /                       ext3    defaults        1 1
LABEL=/os_y             /os_y                   ext3    defaults        1 2
LABEL=/user             /user                   ext3    defaults        1 2
LABEL=/gutsy            /gutsy                  ext3    defaults        1 2
LABEL=/os_x             /os_x                   ext3    defaults        1 2
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
/dev/sda3               swap                    swap    defaults        0 0

Comment 6 Th0ma7 2007-10-25 00:54:37 UTC
having the exact same problem.

It restarted booting properly with kernel 2.6.23.1-30.fc8 but stopped again
tonight with kernel 2.6.23.1-31.fc8 (tried 5 times with -31 with no succes and
switched back to version -30 and works like a charm)

I first tought there was a prblem with my VG's but no... While investigating I
found out that if I would boot in single user mode (by adding single at kernel
boot entry) I would get to a root prompt without any problems... then I can just
type init 5 and the system will work properly... although I totally cannot boot
normally !?!?

The difference is 100% reproducible while comparing tests between kernels
2.6.23.1-30.fc8 (good) vs 2.6.23.1-31.fc8 (bad)

Sadly I have'nt found the source package of kernel version 30 so I cannot tell
what exactly are the difference between the two releases.

[root@gustav ~]# lspci
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio
Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev a2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev a3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev a3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:09.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller
(rev 80)
01:0a.0 Ethernet controller: Marvell Technology Group Ltd. 88E8001 Gigabit
Ethernet Controller (rev 13)
05:00.0 VGA compatible controller: nVidia Corporation NV43 [GeForce 6600 GT]
(rev a2)

- vin

Comment 7 Joshua Rosen 2007-10-26 16:16:58 UTC
kernel 2.6.23.1-31.fc8 is just as broken as 2.6.23.1-30.fc8

I did an update and then tried the new kernel, it fails identically to the -30.
The -06 is still working.

Comment 8 Chuck Ebbert 2007-10-26 21:12:51 UTC
Adding "ignore_loglevel" to the kernel options should make any possible kernel
warnings print during startup.

Adding this line:

set -x

to /etc/rc.sysinit (right after "set -m") will make rc.sysinit print each
command before running it.

Another possibility: edit /etc/fstab and one by one change the last two numbers
on each line to "0 0", rebooting after each change. But make sure the
filesystems are shut down cleanly before each boot...

Also, adding "fastboot" to the kernel options will skip all filesystem checking
during bootup.


Comment 9 Joshua Rosen 2007-10-27 01:57:14 UTC
Created attachment 239731 [details]
Error screen snapshot

I put a set -m into /etc/rc.sysinit.

Here is another snapshot of the error screen.

Comment 10 Th0ma7 2007-10-27 16:26:19 UTC
The new kernel is bootable again for me... (2.6.23.1-35.fc8) Give it a try?

Also, why don't you pass the argument vga=0x305 at the kernel boot line in grub
to have a better resolution hence more info on screen.

- vin

Comment 11 Joshua Rosen 2007-10-27 17:46:59 UTC
.35 doesn't work either. I've also built standard 2.6.23 and 2.6.23.1 and
2.6.23.1 without ext4 kernels, those don't work either. I have one partition
that was formated by Gutsy instead of F8, I set that to not autoload, that
didn't help either.

I'm now in the process of building a 2.6.23.1 kernel without the extended EXT3
attributes. I'll post the results for that when I have them.

Comment 12 Joshua Rosen 2007-10-28 01:04:53 UTC
I've identified the kernel feature that's responsible for the problem, it's
POSIX Access Control Lists. A 2.6.23.1 kernel built with these switches works,

# File systems
#
CONFIG_EXT2_FS=m
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=m
# CONFIG_EXT3_FS_XATTR is not set
# CONFIG_EXT4DEV_FS is not set

These switches don't work

#
# File systems
#
CONFIG_EXT2_FS=m
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
# CONFIG_EXT2_FS_SECURITY is not set
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=m
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y

Comment 13 Th0ma7 2007-10-28 12:32:24 UTC
I wonder.. may theses be somehow related?
https://bugzilla.redhat.com/show_bug.cgi?id=210111

Comment 14 Joshua Rosen 2007-10-28 12:44:08 UTC
A little bit more background on this box. This is a test system I've just put
together so it has nothing on it but Fedora 8 and Gutsy. I partitioned the disk
with F8 Test3, I have SELinux disabled. Is it possible that F8 screwed something
up in the file system when I turned off SELinux during the first boot?

Comment 15 Joshua Rosen 2007-10-28 16:17:39 UTC
It's a reporting problem, the error message fails to give any information about
which partition has the problem or what the problem is. E2fsck reported that all
of the partitions were clean when I ran it without switches, however when I ran
it with the -f switch it found problems with some of the attribute counts. After
I fixed the problems I was able to boot.

So the main problem isn't with 2.6.23, it found the file system errors, it's
with 2.6.22.x and earlier which didn't find a problem. The problem with 2.6.23
is that it needs better error reporting. At the very least it should specify
which partition has the file system problem, it would be better if it also
specfied what the problems are.

Comment 16 Chuck Ebbert 2007-10-29 18:54:43 UTC
(In reply to comment #9)
> Created an attachment (id=239731) [edit]
> Error screen snapshot
> 
> I put a set -m into /etc/rc.sysinit.
> 
The output has scrolled off the screen. Booting with "vga=792" should give
high-resolution mode with more visible lines.

Comment 17 Joshua Rosen 2007-10-29 19:54:22 UTC
I've fixed the file system errors on this system so there isn't anymore input
that I can provide on this problem. As I said in my previous post, I think it
would be a good idea to improve the error reporting. When the boot fails because
of a file system check error it should report the bad partition. It would also
be nice if there was an automatic recovery option in addition to dropping you
into the CLI and having you run e2fsck manually. The auto recovery choice would
run the fscks with a switch that would do all the fixes without asking.

One more thing. It would be nice if the kernel had a switch that allowed it to
log to a USB FLASH key. Right now your only choice except for the console is to
use a serial port connection to a remote console, that's not very convenient.
Pluging in a USB FLASH key would be much easier. It would make the debugging of
problems like this simpler.

Comment 18 Th0ma7 2007-10-30 10:04:22 UTC
The problem went back yesterday using latest kernel (2.6.23.1-37.fc8).

Again, it hanged right before mounting the file systems stating that root
password was needed... I rebooted using the previous working kernel
(2.6.23.1-35.fc8) and it discovered that it needed to do an automatic fsck on
one of my filesystem.... the system did it and booted just fine.

I am wondering if the problem wound not be residing right there, when automatic
disk checking is needed... Did the hang was due to do the disk checking or
simply because it tried to set the fsck flag for the next reboot?  Or would
booting between a Fedora 7 and a rawhide affects the fs check flags?

Anyhow, I tried rebooting using latest kernel and it works now like a charm
(although no disk checking is needed this time...)

Comment 19 Th0ma7 2007-10-30 23:51:58 UTC
Just upgraded to latest kernel (2.6.23.1-41.fc8) and got the same problem except
that this time there was no file system to check?

After a second reboot (by pressing CTRL-D) it rebooted properly this time still
using the new kernel... 

I don't get it?

This might be pointless but could it be related to this:
http://lkml.org/lkml/2007/10/29/280

- vin

Comment 20 Joshua Rosen 2007-10-31 00:07:36 UTC
Have you run e2fsck with a -f switch on all of your partitions? I really had a
file system problem even though e2fsck said my partitions were clean. When I
forced a check it found the problems. 

Comment 21 Joshua Rosen 2007-10-31 15:00:06 UTC
Have you run e2fsck with a -f switch on all of your partitions? I really had a
file system problem even though e2fsck said my partitions were clean. When I
forced a check it found the problems. 

Comment 22 Bill Nottingham 2007-11-01 20:53:43 UTC
WHat happens if you disable rhgb?

Comment 23 Kasper Pedersen 2007-11-07 21:10:48 UTC
I see this on two P4-1.7 machines so it's not x64 specific.

When I disable rhgb and remove quiet, it boots, and keeps booting afterwards -
sometimes. I have to zero the disk and reinstall to repeat it. deleting the
partition table and booting/installing the f8rc3 media won't do it.
2.6.23.1-42.fc8


Comment 24 Th0ma7 2007-11-09 10:56:21 UTC
I had found that booting in single user mode works... (since by default I 
always boot without rhgb and quiet but always add vga=0x305)

At the moment it does'nt do it anymore... it seems "hard" to reproduce 
systematically.

I still strongly presume that the problem occur when a flag of chkdsk must be 
added to a specific FS du to many mounts maybie in conjunction of another 
partition that needs checking...

Comment 25 Chuck Ebbert 2007-11-09 16:52:00 UTC
This was all due to a bug in rhgb. Everything should be fine after updating that.

Comment 26 Dave Jones 2007-11-19 21:28:59 UTC
anyone still seeing these problems after running all the latest updates?


Note You need to log in before you can comment on or make changes to this bug.