Bug 66308

Summary: Filesystem corruption under ext3
Product: [Retired] Red Hat Linux Reporter: John <jsk29>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED NOTABUG QA Contact: Aaron Brown <abrown>
Severity: high Docs Contact:
Priority: medium    
Version: 7.2   
Target Milestone: ---   
Target Release: ---   
Hardware: athlon   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2002-06-10 07:15:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John 2002-06-07 14:03:07 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.0 (X11; Linux i686; U;) Gecko/20020516

Description of problem:
Recently my computer overheated and crashed.  (It's a dual processor Athlon MP
1600 in an unairconditioned room. :)  Not real bright of me.)

I restarted it, and it then _crashed during journal recovery._  I doubt the
crash was due to Linux... I suspect the machine had cooled insufficiently and
crashed.

I let it cool for a long time, and then restarted again.  I did a journal
recovery (I believe) and then moved on w/o any interaction with me.

Later, though, some filesystem corruption was found.  Running "ls" in one of the
home directories returned:
  ls: kpulse10.f: Input/output error
  ls: x9pt003.dat: Input/output error
  ls: x9pr003.dat: Input/output error
  ls: x9ps003.dat: Input/output error
  ls: x9pp003.dat: Input/output error
      fftw_f77.i  kpuls10.dat  kpulse10.out  test.dat  test.f  x9pk003.dat

Looking back at /var/log/messages, there six instances of fsck clearing orphaned
inodes on /home:
Jun  5 22:52:56 crossbow fsck: /home: Clearing orphaned inode 2305690 (uid=502,
gid=502, mode=0100664, size=0) 
Jun  5 22:52:56 crossbow fsck: /home: Clearing orphaned inode 2305695 (uid=502,
gid=502, mode=0100664, size=0) 
Jun  5 22:53:39 crossbow kernel: 0x378: FIFO is 16 bytes
Jun  5 22:52:56 crossbow fsck: /home: Clearing orphaned inode 2305694 (uid=502,
gid=502, mode=0100664, size=0) 
Jun  5 22:53:40 crossbow kernel: 0x378: writeIntrThreshold is 9
Jun  5 22:52:57 crossbow fsck: /home: Clearing orphaned inode 2305696 (uid=502,
gid=502, mode=0100664, size=53248) 
Jun  5 22:53:40 crossbow kernel: 0x378: readIntrThreshold is 9
Jun  5 22:52:57 crossbow fsck: /home: Clearing orphaned inode 2305687 (uid=502,
gid=502, mode=0100664, size=0) 
Jun  5 22:52:57 crossbow fsck: /home: Clearing orphaned inode 2305686 (uid=502,
gid=502, mode=0100775, size=378017) 
Jun  5 22:52:57 crossbow fsck: /home: Clearing orphaned inode 1635643 (uid=500,
gid=500, mode=0100700, size=3184) 
Jun  5 22:52:57 crossbow fsck: /home: clean, 89262/4480448 files,
2354637/8960253 blocks 

(I can't seem to find the startup which crashed while recovering the journal in
the log.   Weird.  I'll have to make another search for it.)

I e2fscked the /home partition, and that seems to have fixed the problem.

But, I thought ext3 was supposed to fix this sort of thing.  It looks like the
journaling system does not gracefully recover if the system crashes during
journal recovery.  (I could be wrong...  that's just what it looks like to me.)

Hopefully this is useful!  Let me know if you need any more information!

John

Version-Release number of selected component (if applicable):


How reproducible:
Didn't try


Additional info:

Comment 1 Stephen Tweedie 2002-06-10 09:48:22 UTC
"I/O error" usually means that the disk has a bad sector and the filesystem
cannot read data from it.

Journaling filesystems protect you from the effects of a crash *assuming the
data is still intact on disk*.  Ext3 does survive a crash during recovery
perfectly well.  However, if you get I/O errors and bad sectors on disk, then
there's nothing that the journaling filesystem can do to correct that.  e2fsck
might fix it simply by removing the unreadable files completely.

I/O errors also occasionally occur because there is corrupt data on disk even if
that data is still readable.  Journaling relies on the data that the filesystem
sent to disk being written correctly.  If the hardware is overheating then it is
quite possible for the data to get corrupted, and in that case again the
filesystem is powerless to protect you: there's no point in writing a journal
carefully to disk if the memory, controller, cpu or disk drive is flipping bits
in the journal on its way.  Again, a full fsck may be required to sort out the
mess afterwards.