Bug 124855 - EXT3-fs error (device hdXX) in start_transaction: Journal has aborted
Summary: EXT3-fs error (device hdXX) in start_transaction: Journal has aborted
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 2
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Stephen Tweedie
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-05-31 12:15 UTC by Pawel Orzechowski
Modified: 2007-11-30 22:10 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-10-04 11:37:48 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
FC2 ks.cfg file used to install all my machines (4.07 KB, text/plain)
2004-06-26 04:14 UTC, Ray Pitmon
no flags Details
/var/log/messages file with error on hda3 (11.40 KB, text/plain)
2004-09-09 10:55 UTC, Pawel Orzechowski
no flags Details
/var/log/messages file with error on hda5 (11.08 KB, text/plain)
2004-09-09 10:55 UTC, Pawel Orzechowski
no flags Details
/var/log/messages file with error on hda7 (23.24 KB, text/plain)
2004-09-09 10:56 UTC, Pawel Orzechowski
no flags Details
output of dmesg (13.95 KB, text/plain)
2004-09-11 19:27 UTC, Kent Fuka
no flags Details
/proc/pci configuration for machine of kent@kncl.com (1.98 KB, text/plain)
2004-09-11 19:30 UTC, Kent Fuka
no flags Details
/proc/cpuinfo for machine of kent@kncl.com (438 bytes, text/plain)
2004-09-11 19:44 UTC, Kent Fuka
no flags Details
/proc/iomem for machine of kent@kncl.com (735 bytes, text/plain)
2004-09-11 19:45 UTC, Kent Fuka
no flags Details
/proc/ioports for machine of kent@kncl.com (669 bytes, text/plain)
2004-09-11 19:46 UTC, Kent Fuka
no flags Details
/proc/modules for machine of kent@kncl.com (697 bytes, text/plain)
2004-09-11 19:47 UTC, Kent Fuka
no flags Details
/sbin/lspci output for machine of kent@kncl.com (7.54 KB, text/plain)
2004-09-11 19:48 UTC, Kent Fuka
no flags Details
fsck output of /home (1013 bytes, text/plain)
2004-09-28 00:47 UTC, Kent Fuka
no flags Details

Description Pawel Orzechowski 2004-05-31 12:15:05 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040116

Description of problem:
For time to time my dmesg output fills with following message:
EXT3-fs error (device hda6) in start_transaction: Journal has aborted.
There is no info in system logs because hda6 is a /var and while error
occures it is remounted read-only.
It already happened twice since upgrade from FC1 to FC2, every time
system was left idle over weekend.
The onlu way to "repair" system is to reboot a machine. Then
everything seems OK.
system is standard PC without any special hardware:
Motherboard Microstar with VIA IDE vt82c686a controller and Pentium
III 800 MHz. Disk is divided into following partitions:
/dev/hda1 on / type ext3 (rw)
/dev/hda3 on /usr type ext3 (rw)
/dev/hda5 on /tmp type ext3 (rw)
/dev/hda6 on /var type ext3 (rw)
/dev/hda7 on /home type ext3 (rw)
I got this error only from /dev/hda6. This is probably because system
was idle, so only writes to /var had been occuring.



Version-Release number of selected component (if applicable):
2.6.5-1.358

How reproducible:
Didn't try

Steps to Reproduce:


Additional info:

I found some info regarding this error on kernel traffic, but it
concerns previous kernel versions (2.6.0). Now it occures with
2.6.5-1.358.

Comment 1 Stephen Lawrence Jr. 2004-06-04 14:37:22 UTC
I have this issue as well after upgrading from FC1 to FC2 on my Dell
PowerEdge 1650 with scsi. Each morning when I come in the system is
displaying that error to the screen over and over. Unable to reboot
machine and have to power-cycle to get it back up.

Comment 2 Stephen Tweedie 2004-06-09 13:53:42 UTC
The "journal is aborted" is just ext3's normal reaction to certain
error conditions.  An IO failure on a critical part of the disk (eg.
the journal) can be enough to trigger it.  So the root cause for this
error can just as easily be a hardware or driver fault as anything to
do with ext3.

Without more complete logs, there's just no way to diagnose why it
happened --- I really need to see the first message from ext3 that
mentions the abort, as that will tell us _why_ ext3 decided to abort
the journal in the first place.

Comment 3 Brian Hayward 2004-06-11 11:40:24 UTC
I get the same error scrolling on the console but it never shows up in
the messages file, nor do I ever see a reason for the Journal aborted
in the messages log.

Comment 4 Stephen Tweedie 2004-06-11 13:05:38 UTC
If the messages file is on the same filesystem as the abort occurred
on, you won't be able to capture it that way --- by definition, the
filesystem goes readonly on abort.

A serial console cable is often the best way to trap the exact error
in these cases.

Comment 5 Brian Hayward 2004-06-12 15:33:34 UTC
Mine turned out to be a bad hard drive.

Comment 6 James Turner 2004-06-12 23:04:43 UTC
I've had this intermittently in several occasions since "upgrading"
from FC1 to FC2. At first I thought it must be a hardware fault with
one of the hard discs but as of now that it has happened to me with
ext3 filesystems on four different discs on two different machines.

Machine 1: Compaq DeskPro EN 866MHz (Intel 815e Chipset?)
           IDE hard discs
Machine 2: Intel 440BX Motherboard
           IDE hard discs

Both machines running FC2, kernel 2.6.5-1.358 (Standard i686 build as
shipped with distro).

The fault last occured on machine 2 while moving and deleting some old
files and directories from the /var partition on /dev/hdc1. Logged in
through ssh, my first indicator of a problem was the filesystem going
read-only. dmesg output produced immediately afterwards:

EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339721: rec_len % 4 != 0 - offset=0, inode=572662306, rec_len=8738,
name_len=34
Aborting journal on device hdc1.
ext3_abort called.
EXT3-fs abort (device hdc1): ext3_journal_start: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device hdc1) in start_transaction: Journal has aborted
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339723: rec_len % 4 != 0 - offset=0, inode=572662306, rec_len=8738,
name_len=34
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339725: rec_len % 4 != 0 - offset=0, inode=3995472896, rec_len=61439,
name_len=237
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339727: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339729: directory entry across blocks - offset=0, inode=1038630372,
rec_len=15856, name_len=244
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339731: rec_len % 4 != 0 - offset=0, inode=954087615, rec_len=14563,
name_len=3
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339733: rec_len % 4 != 0 - offset=0, inode=1025719571, rec_len=15719,
name_len=137
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339735: rec_len % 4 != 0 - offset=0, inode=839659960, rec_len=12866,
name_len=138
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339739: directory entry across blocks - offset=0, inode=1048788601,
rec_len=16024, name_len=104
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339742: rec_len % 4 != 0 - offset=0, inode=850342558, rec_len=12987,
name_len=199
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339744: rec_len % 4 != 0 - offset=0, inode=2535210785, rec_len=60019,
name_len=48
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339746: rec_len % 4 != 0 - offset=0, inode=16843009, rec_len=257,
name_len=1
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339748: rec_len % 4 != 0 - offset=0, inode=4294967295, rec_len=15621,
name_len=69
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339750: rec_len is smaller than minimal - offset=84, inode=4784204,
rec_len=0, name_len=0
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339752: rec_len % 4 != 0 - offset=0, inode=267112565, rec_len=18059,
name_len=12
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339754: rec_len % 4 != 0 - offset=0, inode=4294791301, rec_len=60415,
name_len=29
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339756: directory entry across blocks - offset=0, inode=15493130,
rec_len=49152, name_len=233
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339758: directory entry across blocks - offset=0, inode=3221225626,
rec_len=18408, name_len=34
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339760: rec_len % 4 != 0 - offset=0, inode=2312715081, rec_len=2125,
name_len=117
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339764: directory entry across blocks - offset=0, inode=1190966301,
rec_len=23296, name_len=195
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339766: rec_len % 4 != 0 - offset=0, inode=1374456661, rec_len=33617,
name_len=101
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339768: rec_len % 4 != 0 - offset=0, inode=3354686857, rec_len=63557,
name_len=158
EXT3-fs error (device hdc1): ext3_readdir: bad entry in directory
#339770: rec_len % 4 != 0 - offset=0, inode=2784166633, rec_len=51339,
name_len=131
EXT3-fs error (device hdc1) in start_transaction: Journal has aborted
EXT3-fs error (device hdc1) in start_transaction: Journal has aborted
EXT3-fs error (device hdc1) in start_transaction: Journal has aborted
EXT3-fs error (device hdc1) in start_transaction: Journal has aborted
EXT3-fs error (device hdc1) in start_transaction: Journal has aborted
EXT3-fs error (device hdc1) in start_transaction: Journal has aborted

Rebooted machine and had to run fsck manually - multiple errors
requring manual confirmation, but machine able to boot normally again
afterwards. Will report details if happens again (limiting to just
first few lines if that's all that's relevant).

Comment 7 Ray Pitmon 2004-06-21 04:22:35 UTC
This happens to me everytime I try to upgrade to the latest kernel rpm
(435-smp) (on 3 different identical machines!), and now it happened to
a machine that's been running for a week (on 427). (see my post in the
mailing list)
All machines that had this problem were:
Dell 2450, dual 866 CPU, 1GB RAM, PERC 3/Si raid, 2 drives configured
hardware mirror.
I created a custom boot cd that does a kickstart network install, so
all installs were the exact same as well.

I had syslog going to a different machine, and here are some log
entries where it broke:

Jun 20 19:06:06 dc17 syslogd: /var/log/messages: Read-only file system
Jun 20 19:06:06 dc17 kernel: EXT3-fs error (device sda2):
ext3_free_blocks: Freeing blocks not in datazone - block = 12978214
49, count = 1
Jun 20 19:06:06 dc17 kernel: Aborting journal on device sda2.
Jun 20 19:06:06 dc17 kernel: EXT3-fs error (device sda2):
ext3_free_blocks: Freeing blocks not in datazone - block = 34942840
90, count = 1
Jun 20 19:06:07 dc17 kernel: ext3_abort called.
Jun 20 19:06:07 dc17 kernel: EXT3-fs abort (device sda2):
ext3_journal_start: Detected aborted journal
Jun 20 19:06:07 dc17 kernel: Remounting filesystem read-only
Jun 20 19:06:07 dc17 kernel: EXT3-fs error (device sda2) in
start_transaction: Journal has aborted
Jun 20 19:06:07 dc17 kernel: EXT3-fs error (device sda2):
ext3_free_blocks: Freeing blocks not in datazone - block = 40156468
81, count = 1
Jun 20 19:06:07 dc17 kernel: EXT3-fs error (device sda2):
ext3_free_blocks: Freeing blocks not in datazone - block = 91000576
8, count = 1
Jun 20 19:06:07 dc17 kernel: EXT3-fs error (device sda2):
ext3_free_blocks: Freeing blocks not in datazone - block = 56902952
4, count = 1
.. several pages of these, then a few these..
Jun 20 19:06:16 dc17 kernel: EXT3-fs error (device sda2) in
start_transaction: Journal has aborted
.. many more of these..
Jun 20 19:07:27 dc17 kernel: EXT3-fs error (device sda2):
ext3_free_blocks: Freeing blocks not in datazone - block = 25129345
86, count = 1
.. then this..
Jun 20 19:07:28 dc17 kernel: ext3_reserve_inode_write: aborting
transaction: Journal has aborted in __ext3_journal_get_write_
access<2>EXT3-fs error (device sda2) in ext3_reserve_inode_write:
Journal has aborted
Jun 20 19:07:28 dc17 kernel: EXT3-fs error (device sda2) in
ext3_truncate: Journal has aborted
Jun 20 19:07:28 dc17 kernel: ext3_reserve_inode_write: aborting
transaction: Journal has aborted in __ext3_journal_get_write_
access<2>EXT3-fs error (device sda2) in ext3_reserve_inode_write:
Journal has aborted
Jun 20 19:07:28 dc17 kernel: EXT3-fs error (device sda2) in
ext3_orphan_del: Journal has aborted
Jun 20 19:07:28 dc17 kernel: ext3_reserve_inode_write: aborting
transaction: Journal has aborted in __ext3_journal_get_write_
access<2>EXT3-fs error (device sda2) in ext3_reserve_inode_write:
Journal has aborted
Jun 20 19:07:29 dc17 kernel: EXT3-fs error (device sda2) in
ext3_delete_inode: Journal has aborted
Jun 20 19:07:29 dc17 kernel: __journal_remove_journal_head: freeing
b_committed_data
Jun 20 19:07:30 dc17 kernel: EXT3-fs error (device sda2) in
start_transaction: Journal has aborted

Comment 8 Stephen Tweedie 2004-06-23 23:51:37 UTC
All I can see from these is that there's some data corruption on disk.
 The problems are not consistent between reports here.  Some of the
reports are UP, some SMP.  I suspect we're going to need more data to
figure out the common issue here, whether it's something stomping on
kernel memory, or driver issues, or filesystem issues, or possibly
several different problems altogether.

Are these cases using ACLs, SELinux, quotas?  Exactly what problems
(if any) does fsck report afterwards?  Are there any other driver
errors in the logs?  Is there any pattern of workload that seems to
trigger the problem?

The "everytime I try to upgrade to the latest kernel rpm" one looks
really weird.  I wonder if there's a PERC driver bug in the previous
kernel version there.  Trying the upgrade with a UP kernel might be
one thing to try.

Comment 9 Stephen Tweedie 2004-06-24 13:03:45 UTC
One other thing --- whenever this occurs, we need to look carefully
for the last kernel messages *before* the journal goes offline. 
Journal aborts are usually due to some form of IO error, either an IO
failure or corrupted data from an IO, and it is important to know
whether or not there are any driver log messages prior to the abort
which could indicate a problem.

Comment 10 Stephen Tweedie 2004-06-24 13:20:52 UTC
Usually the only way to take this sort of bug report forward is either
just to collect enough data that patterns become apparent, or to find
a specific reproducer for the problem.  

The only obviously reproducible error here is the "everytime I try to
upgrade to the latest kernel rpm (435-smp)" one.  From the fedora-list
email:

>>>>
All I'm doing is 'yum update', and it downloads the latest kernel
(single and smp), then does a test transaction, then updates the
single cpu rpm, then it tries to update the SMP kernel rpm, and it
blows up, with a whole bunch of these:

scsi1 (0:0): rejecting I/O to offline device
>>>>

which is pretty clearly not a filesystem problem at root --- if the
filesystem is getting its I/O rejected because the underlying device
is offline, then the journal abort is definitely its expected response!  

Have you been able to capture any more from the logs?  We really need
to know why the device went offline to take this particular case any
further.

Comment 11 Ray Pitmon 2004-06-26 04:14:52 UTC
Created attachment 101432 [details]
FC2 ks.cfg file used to install all my machines

This will tell you exactly what I've got installed on my machine when I then do
a yum upgrade and it causes the FS to go south on a kernel update. Before the
kernel update, I had already applied most other updates available.

Comment 12 Ray Pitmon 2004-06-26 04:16:07 UTC
I meant to add this comment before the above one, so here goes..

Unfortunately I don't think I was remotely logging the machines at the
time that I was having the kernel-upgrade crashes.  I have since put
FC1 on most of the machines, because I need to actually do something
with them, but I do have 1 or 2 that still have FC2 installed, so I'll
try to do the upgrade again, and save the output from a console and
setup remote syslog.  Is there anything special you want me to do
before/when I try to do the upgrade?

Looking thru the logs I do have, the crash on Jun20th (which was
random, not from a kernel upgrade attempt), here are the first
messages to appear in the log for several hours (same as above):

Jun 20 19:06:06 dc17 syslogd: /var/log/messages: Read-only file system
Jun 20 19:06:06 dc17 kernel: EXT3-fs error (device sda2):
ext3_free_blocks: Freeing blocks not in datazone - bloc
k = 1297821449, count = 1
Jun 20 19:06:06 dc17 kernel: Aborting journal on device sda2.

There were no other messages from that host logged before the above
entries.

All of the machines that I had the problem with went thru the exact
same install via the same remote ks file.  Nothing special, pretty
well stripped down except for a webserver and basic X (not window
managers even), gcc, etc..  Actually, why don't I just upload the
ks.cfg file and you can take a look for yourself.

-ray



Comment 13 Stephen Tweedie 2004-06-28 11:44:19 UTC
The "ext3_free_blocks: Freeing blocks not in datazone - block =
1297821449, count = 1" seems to be a different pattern from the "death
on kernel upgrade."  But in both cases I've really got no idea what
the underlying problem is --- really, capturing as much log output as
possible is going to be the best way to narrow things down right now.
 Thanks.

Comment 14 Benjamin Wykes 2004-09-05 17:22:36 UTC
I have this issue - the original "EXT3-fs error (device hda2) in
start_transaction: Journal has aborted." happening regularly after
approx 3 days uptime.
Could you inform me of the best way to troubleshoot this once it has
happend?
I still have access to xterms/console (although other applications
have been killed off - and any remote seessions are unavailable) Some
commands are still available to me whilst the machine is in this error
state without producing an I/O error (commands like 'more' for example
produce I/O errors).
Many thanks.

Comment 15 Pawel Orzechowski 2004-09-09 10:52:48 UTC
I have some additional info that might be helpful: 
- the error occures on diferent filesystems (my disk 
has: /, /usr, /tmp, /var, /home) 
- disk was SMART checked and it did not report any error 
- before "journal aborted" there is a message: 
"EXT3-fs error (device hda7): ext3_free_blocks: bit already cleared 
for block 8939931" 

Comment 16 Pawel Orzechowski 2004-09-09 10:55:20 UTC
Created attachment 103633 [details]
/var/log/messages file with error on hda3

Comment 17 Pawel Orzechowski 2004-09-09 10:55:54 UTC
Created attachment 103634 [details]
/var/log/messages file with error on hda5

Comment 18 Pawel Orzechowski 2004-09-09 10:56:17 UTC
Created attachment 103635 [details]
/var/log/messages file with error on hda7

Comment 19 Stephen Tweedie 2004-09-10 15:28:39 UTC
Do current rawhide kernels show the same problem?  Does a memtest86
show hardware problems?

Right now, all I can see is you're getting memory or IO corruption
somewhere.  There's nothing conclusive showing whether it's a kernel
or hardware fault, or whether it's memory or disk or something in between.

Comment 20 Pawel Orzechowski 2004-09-11 17:03:34 UTC
1. Well, I did not try rawhide kernel. 
2. Memtest86 does not show any errors (but I will run it once more) 

Comment 21 Kent Fuka 2004-09-11 19:19:30 UTC
I am also experiencing these problems on IDE drives on a uniprocessor
under kernels as recent as kernel-2.6.8-1.521.  I can reproduce this
problem within about 24 hours of runtime by doing full amanda backups.
 The error messages are either "freeeing blocks not in datazone",
"directory #NNNNN contains a hole at offset XXXX" or "attempt to
access beyond end of device".  I previously experienced this problem
over the past 9 months on this machine under RH9 as well, so it has
probably been around for a while. This was the problem that lead me to
try upgrading to FC2 in the first place.  Suspecting "bit rot" on the
drive, I have swapped the hard drive and IDE cable which did not fix
the problem.  I do not see file data corruption, only problems with
the filesystem structure metadata being corrupted, which also leads me
to suspect that this is not a hardware problem. Oddly, the files that
are corrupted are rarely ones we actually use, and certainly aren't
modifying (e.g. a Makefile for an M68K software package we have never
run or a file in a graphics package and we don't run X on this machine).  

Comment 22 Kent Fuka 2004-09-11 19:27:02 UTC
Created attachment 103726 [details]
output of dmesg

Comment 23 Kent Fuka 2004-09-11 19:30:25 UTC
Created attachment 103727 [details]
/proc/pci configuration for machine of kent

Comment 24 Kent Fuka 2004-09-11 19:44:40 UTC
Created attachment 103729 [details]
/proc/cpuinfo for machine of kent

Comment 25 Kent Fuka 2004-09-11 19:45:26 UTC
Created attachment 103730 [details]
/proc/iomem for machine of kent

Comment 26 Kent Fuka 2004-09-11 19:46:01 UTC
Created attachment 103731 [details]
/proc/ioports for machine of kent

Comment 27 Kent Fuka 2004-09-11 19:47:16 UTC
Created attachment 103732 [details]
/proc/modules for machine of kent

Comment 28 Kent Fuka 2004-09-11 19:48:04 UTC
Created attachment 103733 [details]
/sbin/lspci output for machine of kent

Comment 29 Stephen Tweedie 2004-09-13 10:46:37 UTC
But are you actually checking for file data corruption?

Also, if there are memory errors and the files concerned are mostly
being accessed for read, it's entirely possible to get errors in
memory which don't get propagated back to disk.  That can easily
result in a system passing a simple data verify, but still failing
with data errors under load.

Re memtest comment above:

running memtest86 once is not a good test of memory.  I usually
recommend at least a full overnight run.

http://www.bit-net.com/~rmiller/dt.html is a useful test program that
verifies the data it writes, when there's a suspicion of bad hardware.
   A problem like this that's reproducible on a wide range of kernels
definitely sounds like a hardware problem.



Comment 30 Stephen Tweedie 2004-09-13 10:48:24 UTC
Of course, before running any disk check program, make double-sure
that the filesystem you're running on is already free from errors by
fscking it (or just run on a newly-mkfs'ed filesystem).  You can
"touch /forcefsck" then reboot to force a full check of all filesystems.

Comment 31 Kent Fuka 2004-09-14 05:12:10 UTC
To check for file corruption, I run tripwire each night, which
performs an MD5 checksum (amoung other tests) over most (but not all)
of the files on my system.  If there were data corruption, the
checksums would differ.  Thus far (i.e in the last 9 months going back
to RH9), the only files that tripwire reports as bad are the ones
where the metadata has gone bad.  One thing I noticed yesterday while
logging messages from an fsck session is that many (but not all) of
the bad files have illegal blocks as blocks 8 and/or 9 (almost never
1-7).  Looking back through my console log, I see that this is
consistent.  Often, fsck winds up also changing the number allocated
blocks in the inode from 8 down to 1 (or 0).

To try to make progress, I've copied all of my ext3 filesystems onto
new ext2 filesystems on another drive, and have booted from the ext2
drive.  This is a drive which had gotten these types of errors when
the filesystems were ext3.  If the problem doesn't re-occur over the
next few days on ext2 fs, I'm going to suspect even more that the
problem is specific to ext3 fs.

As for running fsck, I wind up doing this almost daily, making it seem
to me that I can reproduce this problem in about 24 hours.  Then
amanda kicks off at 1 am. As long as I don't run amanda full backups,
I rarely see any problems. Running amanda full backups, I generally
have errors in the logs in the morning that occur while the backups
were running.  I suspect that the backups generate enough disk traffic
to cause the problem to occur.  Tripwire normally runs at 4am each
day.  I check the logs for ext3 errors and the tripwire report for
errors about 9am each day.

Thanks for the suggestion about dt.  I've looked at it.  I suppose I
ought to run it on a raw disk partition (or drive).  Is /dev/raw/* no
longer included in the kernel by default?  i.e. What's the best way to
name a raw disk partition in FC2?  Specifying (for example) /dev/hdc7
to dt gets (not surprisingly) the block device.  Do I need to re-gen
the kernel to get raw disk dev support?

Comment 32 Stephen Tweedie 2004-09-14 12:03:18 UTC
The raw driver had a ton of limitations, such as an upper limit of 255
raw devices active at once.  The new way to achieve the same thing on
2.6 kernels is to open the device with O_DIRECT instead.  

Is the problem _always_ in blocks 8/9?  Or just usually?  That's a
strange symptom: there's just nothing special about those blocks in
the fs code (the first "special" block in a file is block 12, when we
start using indirect blocks.)  Are there any particular apps you've
got that might be doing something unusual at that offset in a file? 
Is there any pattern to the files which show corruption?

Are you using xattrs?  Quotas?  What other modules are in use -- there
could be one stomping on ext3's memory here.  Running a kernel-debug
kernel would be one way to try to catch that.

Could you capture a log of the fsck output at some point, showing the
exact errors it detects, please?

As for running dt, I think there's value both in running it on a fs
itself _and_ on the raw device.  Each provides different info.

Sorry, but there are lots of questions here.  There's simply no
obvious pattern at present; we have tons of users using ext3 _very_
heavily without these problems being reported, and in our internal
testing I've never seen them reported either.  So it's going to be a
bit of a hunt to find what's special in your case.

Comment 33 Kent Fuka 2004-09-28 00:38:05 UTC
Here's an update: I've had the problem appear on ext2 filesystems as
well.  I can reproduce it using rsync to copy my /home filesystem from
one partition to another.  I'll attach a "typical" fsck output for /home.

I am not running xattrs or quotas.  The list of modules I use is
included in the prior /proc/modules attachment, I believe.

I'll try dt with "flags=direct" on a filesystem, and on a raw drive
and report on what I find.

I've run memtest86 overnight in extended tests mode with no errors.


Comment 34 Kent Fuka 2004-09-28 00:47:02 UTC
Created attachment 104413 [details]
fsck output of /home

Scenario:  In single-user mode, ran fsck cleanly on /home.  ran "rsync -uax
--verbose /home/ /mnt/home_copy/".  Re-ran fsck on /home and got the errors in
this attachment.  In theory only writes to /home should be updating the file
access times of of every inode in /home.  /home is /dev/hda6.  /mnt/home_copy
is /dev/hdc6.

Comment 35 Stephen Tweedie 2004-09-28 10:19:15 UTC
Kent, Some notes:

First, tripwire does not help detect data corruption if the corruption
is occurring on writes (it will only end up checksumming the incorrect
data!)

Does fsck run cleanly if you run again after fixing up the problems?

This particular problem has never felt like an ext3 problem; if you
can reproduce with ext2, that's confirmed.  It might still be
hardware; it might be some other part of the system scribbling on the
inode structure in memory (ext2 and ext3 keep direct block 8/9 at the
same offset into their inode structs, although the structs are rather
different sizes.)

The "dt" program (http://www.bit-net.com/~rmiller/dt.html) is an
excellent way to start hunting for the first possibility.  Building a
debug kernel (especially one with CONFIG_DEBUG_SLAB set) would be a
good first step towards isolating the second.

Comment 36 Kent Fuka 2004-10-01 23:37:24 UTC
Agreed about tripwire.  Wouldn't detect data changing on writes.  My
point was that it would detect "bit rot" in the data -- errors on
re-reading.

Fsck runs clean after fixing up any problems.

I saw two some single-bit errors with dt after a few hundred passes
writing to a raw disk partition.  I'm going to conclude based on this
that it's a hardware defect or else I need to us hdparm to back down
the UDMA performance.  You can close this bug as far as I'm concerned...

Thanks for the help.

Comment 37 Stephen Tweedie 2004-10-04 11:37:48 UTC
OK, will do --- thanks for helping to chase this down.



Note You need to log in before you can comment on or make changes to this bug.