110528 – File system corruption RH9/ext3

Bug 110528 - File system corruption RH9/ext3

Summary: File system corruption RH9/ext3

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	9
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-11-20 19:52 UTC by Curtis Regentin
Modified:	2015-01-04 22:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-30 15:41:43 UTC
Embargoed:

Attachments	(Terms of Use)

Description Curtis Regentin 2003-11-20 19:52:11 UTC

Description of problem:
I'm having some file-system corruption problems on three systems.  All
will reliably repeat the problem, though it often takes days of
thrashing.  All are running 3ware SATA RAID controllers (though in
differing RAID configurations), ext3, RH9 with kernel 2.4.20-18. 
There is no indication of hardware trouble - no IO errors, timeouts,
dropped disks, etc.  The systems are all dual Xeon with ECC RAM.

After some time they report:"Allocating block in system zone - XXXXX",
repeatedly, where XXX are some block numbers with file-system metadata
(from dumpe2fs).  This, of course, renders the file-system wholly
broken, and shortly the system becomes useless.

All three systems have been running overnight on reiserfs, with no
problems.



Version-Release number of selected component (if applicable):
RH9, kernel 2.4.20-18

How reproducible:
Sadly, unreliable.  Many hours running a burn script.
  
Actual results:
Filesystem totally clobberred.

Expected results:
Filesystem not totally clobberred. ;)

Comment 1 Curtis Regentin 2003-11-20 19:52:40 UTC

While investigating, I found some concerns.

1)
While the tuning of ext3's error behavior is available, this may a
case where "goto error_return" is in order instead of knowingly
smashing the file-system.  Philosophical issue, I guess.

2)
The notes on the web site regarding 2.4.20-18 contain the following
statement:
"
A potential data corruption scenario has been identified. This
scenario can occur under heavy, complex I/O loads. The scenario
only occurs while performing memory mapped file I/O, where the
file is simultaneously unlinked and the corresponding file blocks
reallocated. Furthermore, the memory mapped writes must be to a
partial page at the end of a file on an ext3 file system. As such,
Red Hat considers this an unlikely scenario.
"
This statement is in the list of bugs fixed.  Is it fixed, or is it
identified?  If it's not fixed, what are the symptoms?

3)
I checked out the patches on the latest kernel (2.4.20-20), which
contain a bunch of new checks in the ext3 block allocation and freeing
routine.  This would lead me to believe that I'm not the only one
seeing the problem.  Also in this patch set
(linux-2.4.20-selected-ac-bits.patch) on line 48466 is the following
change:

@@ -336,7 +335,6 @@ do_more:
    wait_on_buffer (bh);
  }
  if (overflow) {
-   block += count;
    count = overflow;
    goto do_more;
  }

Now, I don't know the fs code very well, but this appears to
completely disable the freeing of block ranges spanning group
boundaries, and results in continuously freeing the same blocks at the
end of the first group over and over until "count" runs out.  It seems
to me (in my ignorance) that freeing block ranges spanning group
boundaries may be a bad thing indeed - but I would think it would
indicate an error in the code calling the free routine, and should not
be handled in the free routine by doing bizarre things.  Again, I may
be wholly ignorant in this.

If my assumptions are correct, it would seem that if freeing blocks
spanning group boundaries is a problem (because metadata is on the
boundry?), that this code would hide the problem - but cause some
blocks that should be freed, to never be freed.

So I'd like to know:
Is this a known bug? What is it? Is it fixed?
Is the code in 2.4.20-20 as scary as it looks?

Comment 3 Bugzilla owner 2004-09-30 15:41:43 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.