126730 – ext3 corrupt file(system)

Bug 126730 - ext3 corrupt file(system)

Summary: ext3 corrupt file(system)

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	2
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-06-25 12:57 UTC by Need Real Name
Modified:	2007-11-30 22:10 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-04-15 14:17:52 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/var/log/messages.1 with smbd and nmbd entries removed. (80.00 KB, text/plain) 2004-07-06 14:09 UTC, Terry Bowling	no flags	Details
View All

Description Need Real Name 2004-06-25 12:57:05 UTC

Description of problem:

Corrupted filesystem see on a dual xeon box with a 3ware controller.
The logs say:
Jun 25 09:10:34 x kernel: attempt to access beyond end of device
Jun 25 09:10:34 x kernel: sda2: rw=0, want=7534144408, limit=614405925

An rync of the data failed on a file with IO errors, this file
was corrupt (old size < 1K, new size over 1Tb). When removing this
file the ext3 filesystem remounted readonly (which I guess is a very
good thing, saves me from more problems).

fsck seems to have fixed the broken file. This problem of getting
corrupt files has now occured twice in 10 days.

Version-Release number of selected component (if applicable):


How reproducible:
Server system, which nfs exports the data to linux clients.
I haven't isolated when the corruption starts. 
  
Actual results:
files get corrupt

Expected results:


Additional info:

Comment 1 Terry Bowling 2004-07-02 19:32:39 UTC

I think I've experience a similar problem.  This morning my data
directory was mounted as read only, even to root.  It was not until I
rebooted that I could see the Ext3 fs was corrupt.  It said it could
not repair it.  It dropped me to a command line and I had to manually
fsck it an allow it to repair.  It seems fine so far, but I'm nervous.

I was using:
Fedora Core 2 (2.6.6.1-435)
Samba 2.0.3-5
Ext3 filesystem is /dev/md3 (hda6,hdc6)

For further info, see my post on the samba list
http://article.gmane.org/gmane.network.samba.general/46564

Comment 2 Stephen Tweedie 2004-07-05 11:43:40 UTC

I'd really need to see full logs from the kernel, from before you
started noticing the problem, to have any hope of getting further with
this.

Comment 3 Terry Bowling 2004-07-06 14:09:43 UTC

Created attachment 101655 [details]
/var/log/messages.1 with smbd and nmbd entries removed.

Comment 4 Terry Bowling 2004-07-06 14:11:04 UTC

Ok, I've attached the file /var/log/messages.1.  I actually used a
'grep -v' to strip out all of the smbd and nmbd messages.  I hope this
still gives you the info you need.  In my /var/log/messages.1 file, I
found the following kernel errors:

Jul  2 04:02:16 fwinsites logrotate: ALERT exited abnormally with [1]
Jul  2 04:04:21 fwinsites kernel: EXT3-fs error (device md3):
ext3_find_entry: bad entry in directory #5046464: inode out of bounds
- offset=8192, inode=16777216, rec_len=32, name_len=22
Jul  2 04:04:21 fwinsites kernel: Aborting journal on device md3.
Jul  2 04:04:21 fwinsites kernel: ext3_abort called.
Jul  2 04:04:21 fwinsites kernel: EXT3-fs abort (device md3):
ext3_journal_start: Detected aborted journal
Jul  2 04:04:21 fwinsites kernel: Remounting filesystem read-only

Another concern is that it did not email this info to me.  Shouldn't
it have let me know there was a problem?  Every day I receive a status
email, so I know the log notification is working.

Comment 5 Fabio Pistillo 2004-12-01 16:22:39 UTC

I have the same problem and at the same time????
/var/log/messages:

Nov 30 19:00:54 localhost kernel: mtrr: type mismatch for 
f6000000,800000 old: write-back new: write-combining
Nov 30 19:00:54 localhost kernel: mtrr: type mismatch for 
f6000000,800000 old: write-back new: write-combining
Dec  1 04:02:02 localhost logrotate: ALERT exited abnormally with [1]
Dec  1 11:53:52 localhost login(pam_unix)[2678]: authentication 
failure; logname= uid=0 euid=0 tty=pts/1 ruser= rhost=10.10.19.73  
user=oracle10

Comment 6 Stephen Tweedie 2005-01-28 16:09:02 UTC

ext3_find_entry: bad entry in directory #5046464: inode out of bounds
- offset=8192, inode=16777216, rec_len=32, name_len=22

is still just a sign that something went bad.  Is there any other sign of
anything going bad in the logs?  What's the hardware?  

It's basically impossible to diagnose this sort of thing remotely.  95% of the
time or more it's bad hardware that is the root cause.  memtest86 is a useful
start, as is "dt" to test the disks.  But without a reproducible pattern of
failure, it's impossible to say why your systems are failing in ways which we
never see during extensive testing.

Comment 7 Stephen Tweedie 2005-04-15 14:17:52 UTC

Please reopen if this can be reproduced on FC3.

Note You need to log in before you can comment on or make changes to this bug.