Bug 126730

Summary: ext3 corrupt file(system)
Product: [Fedora] Fedora Reporter: Need Real Name <niels>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED WORKSFORME QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 2CC: terry.bowling
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-04-15 14:17:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/var/log/messages.1 with smbd and nmbd entries removed. none

Description Need Real Name 2004-06-25 12:57:05 UTC
Description of problem:

Corrupted filesystem see on a dual xeon box with a 3ware controller.
The logs say:
Jun 25 09:10:34 x kernel: attempt to access beyond end of device
Jun 25 09:10:34 x kernel: sda2: rw=0, want=7534144408, limit=614405925

An rync of the data failed on a file with IO errors, this file
was corrupt (old size < 1K, new size over 1Tb). When removing this
file the ext3 filesystem remounted readonly (which I guess is a very
good thing, saves me from more problems).

fsck seems to have fixed the broken file. This problem of getting
corrupt files has now occured twice in 10 days.

Version-Release number of selected component (if applicable):


How reproducible:
Server system, which nfs exports the data to linux clients.
I haven't isolated when the corruption starts. 
  
Actual results:
files get corrupt

Expected results:


Additional info:

Comment 1 Terry Bowling 2004-07-02 19:32:39 UTC
I think I've experience a similar problem.  This morning my data
directory was mounted as read only, even to root.  It was not until I
rebooted that I could see the Ext3 fs was corrupt.  It said it could
not repair it.  It dropped me to a command line and I had to manually
fsck it an allow it to repair.  It seems fine so far, but I'm nervous.

I was using:
Fedora Core 2 (2.6.6.1-435)
Samba 2.0.3-5
Ext3 filesystem is /dev/md3 (hda6,hdc6)

For further info, see my post on the samba list
http://article.gmane.org/gmane.network.samba.general/46564

Comment 2 Stephen Tweedie 2004-07-05 11:43:40 UTC
I'd really need to see full logs from the kernel, from before you
started noticing the problem, to have any hope of getting further with
this.  

Comment 3 Terry Bowling 2004-07-06 14:09:43 UTC
Created attachment 101655 [details]
/var/log/messages.1 with smbd and nmbd entries removed.

Comment 4 Terry Bowling 2004-07-06 14:11:04 UTC
Ok, I've attached the file /var/log/messages.1.  I actually used a
'grep -v' to strip out all of the smbd and nmbd messages.  I hope this
still gives you the info you need.  In my /var/log/messages.1 file, I
found the following kernel errors:

Jul  2 04:02:16 fwinsites logrotate: ALERT exited abnormally with [1]
Jul  2 04:04:21 fwinsites kernel: EXT3-fs error (device md3):
ext3_find_entry: bad entry in directory #5046464: inode out of bounds
- offset=8192, inode=16777216, rec_len=32, name_len=22
Jul  2 04:04:21 fwinsites kernel: Aborting journal on device md3.
Jul  2 04:04:21 fwinsites kernel: ext3_abort called.
Jul  2 04:04:21 fwinsites kernel: EXT3-fs abort (device md3):
ext3_journal_start: Detected aborted journal
Jul  2 04:04:21 fwinsites kernel: Remounting filesystem read-only

Another concern is that it did not email this info to me.  Shouldn't
it have let me know there was a problem?  Every day I receive a status
email, so I know the log notification is working.

Comment 5 Fabio Pistillo 2004-12-01 16:22:39 UTC
I have the same problem and at the same time????
/var/log/messages:

Nov 30 19:00:54 localhost kernel: mtrr: type mismatch for 
f6000000,800000 old: write-back new: write-combining
Nov 30 19:00:54 localhost kernel: mtrr: type mismatch for 
f6000000,800000 old: write-back new: write-combining
Dec  1 04:02:02 localhost logrotate: ALERT exited abnormally with [1]
Dec  1 11:53:52 localhost login(pam_unix)[2678]: authentication 
failure; logname= uid=0 euid=0 tty=pts/1 ruser= rhost=10.10.19.73  
user=oracle10


Comment 6 Stephen Tweedie 2005-01-28 16:09:02 UTC
ext3_find_entry: bad entry in directory #5046464: inode out of bounds
- offset=8192, inode=16777216, rec_len=32, name_len=22

is still just a sign that something went bad.  Is there any other sign of
anything going bad in the logs?  What's the hardware?  

It's basically impossible to diagnose this sort of thing remotely.  95% of the
time or more it's bad hardware that is the root cause.  memtest86 is a useful
start, as is "dt" to test the disks.  But without a reproducible pattern of
failure, it's impossible to say why your systems are failing in ways which we
never see during extensive testing.


Comment 7 Stephen Tweedie 2005-04-15 14:17:52 UTC
Please reopen if this can be reproduced on FC3.