Bug 133959

Summary: Restarting after a power hit fsck fails with "fsck.ext3: Invalid argument while checking ext3 journal for /
Product: [Fedora] Fedora Reporter: John Poelstra <poelstra>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3CC: pfrields, sct
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-12-03 16:49:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Poelstra 2004-09-28 18:04:05 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7)
Gecko/20040803 Firefox/0.9.3

Description of problem:
Machine has SELinux set to active and with default partition
recommendations at startup (which included two logical volumes in
volgroup 00).

Machine crashed from power loss.  Upon restart let boot sequence
progress without intervention (did not select "Y" to force check).

error message returned to console is "fsck.ext3: Invalid argument
while checking ext3 journal for /"

system drops to maintenance prompt. 

Next I ran "fsck -y" and there were a MASSIVE amount of fixes, perhaps
related to logical volumes?  This box has been up less than 24 hours.
The fsck went on for more than an hour before I gave up and
reinstalled the OS.

Version-Release number of selected component (if applicable):
e2fsprogs-1.35-11.i386.rpm

How reproducible:
Didn't try

Steps to Reproduce:
1. Install rawhide 9-27-04 on i386 w/ SELinux turned on and accept
default drive paritition configuration
2. Run x11perf -all, Evolution, OO and a few other apps
3. Turn off power
    

Additional info:

9-27-2004 rawhide

Comment 1 Thomas Woerner 2004-10-04 13:58:49 UTC
This is no e2fsprogs bug.

Comment 2 John Poelstra 2004-10-04 14:06:59 UTC
Is this a bug in a different component then?  If so which component?

Comment 3 Thomas Woerner 2004-10-04 14:53:24 UTC
This is a kernel bug. Assigning to kernel.

Comment 4 Dave Jones 2004-11-27 21:40:48 UTC
is this still a problem with the final FC3 release + updates ?


Comment 5 Stephen Tweedie 2004-11-27 22:28:33 UTC
Poweroff is, unfortunately, something we can't always do anything about in software.

When power fails, all sorts of things can start to go wrong.  For example, as
the voltage starts to drop, RAM can start failing even while the rest of the
motherboard is working fine, and you end up writing bogus data to disk as the
system tries to complete existing IOs.  Or bus errors can end up sending data to
the wrong part of the disk.  Or the disk write caching can be ineffective and
data the OS thinks is on disk is actually not yet written through to permanent
storage.

So without more information, it's really not possible to say whether this is a
hardware or a software problem. 

Comment 6 John Poelstra 2004-12-03 16:13:30 UTC
My orginal concern was that upon rebooting the message I got was: 
fsck.ext3: Invalid argument while checking ext3 journal for /

That didn't make sense to me was receiving an "invalid argument" considering
that I had not entered anything.... simply booted the machine.  Could it be that
my system was so hosed it couldn't find / and thus that was the problem? thus
resulting in the invalid argument error?

I tried to reproduce the problem by reloading the OS, etc. on the same version,
but was unable to make it happen again. 

Perhaps cutting power and corrupting / would be another way to go about it? (I
am not quite sure how to do this, but would be willing to given some instructions)

Comment 7 Stephen Tweedie 2004-12-03 16:42:33 UTC
The "invalid argument" means that the e2fsck fs checker got an EINVAL error from
the kernel.  It looks like a corrupt journal descriptor.  Without the rest of
the boot logs, I can't tell how much else of the root volume it was complaining
about.  

So all we know is, for reasons unknown, some of the static filesystem metadata
for the root fs was not properly found on reboot.  Either there was corruption
in the LVM metadata or on the partition itself.  We've no idea if it was
hardware or software which caused it, and it doesn't seem to be easily reproducible.

Forcibly corrupting "/" in software won't help much, there's not necessarily
much the OS can do without a root fs other than to try to recover it from a
rescue CD.

Cutting power and looking for corruption might confirm it's a hardware problem,
if you're willing to risk trashing the machine repeatedly until it reproduces.

Otherwise, we probably just need to close this as non-reproducible for now, as
it does look more like power-induced hardware disk corruption than anything else.

Comment 8 John Poelstra 2004-12-03 16:49:49 UTC
Thanks for your comments.  I concur and believe it reasonable to close this
issue for now.