Bug 110528

Summary:	File system corruption RH9/ext3
Product:	[Retired] Red Hat Linux	Reporter:	Curtis Regentin <cregentin>
Component:	kernel	Assignee:	Dave Jones <davej>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	9	CC:	cregentin, pfrields, ppokorny, sct
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-30 15:41:43 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Curtis Regentin 2003-11-20 19:52:11 UTC

Description of problem:
I'm having some file-system corruption problems on three systems.  All
will reliably repeat the problem, though it often takes days of
thrashing.  All are running 3ware SATA RAID controllers (though in
differing RAID configurations), ext3, RH9 with kernel 2.4.20-18. 
There is no indication of hardware trouble - no IO errors, timeouts,
dropped disks, etc.  The systems are all dual Xeon with ECC RAM.

After some time they report:"Allocating block in system zone - XXXXX",
repeatedly, where XXX are some block numbers with file-system metadata
(from dumpe2fs).  This, of course, renders the file-system wholly
broken, and shortly the system becomes useless.

All three systems have been running overnight on reiserfs, with no
problems.



Version-Release number of selected component (if applicable):
RH9, kernel 2.4.20-18

How reproducible:
Sadly, unreliable.  Many hours running a burn script.
  
Actual results:
Filesystem totally clobberred.

Expected results:
Filesystem not totally clobberred. ;)

Comment 1 Curtis Regentin 2003-11-20 19:52:40 UTC

While investigating, I found some concerns.

1)
While the tuning of ext3's error behavior is available, this may a
case where "goto error_return" is in order instead of knowingly
smashing the file-system.  Philosophical issue, I guess.

2)
The notes on the web site regarding 2.4.20-18 contain the following
statement:
"
A potential data corruption scenario has been identified. This
scenario can occur under heavy, complex I/O loads. The scenario
only occurs while performing memory mapped file I/O, where the
file is simultaneously unlinked and the corresponding file blocks
reallocated. Furthermore, the memory mapped writes must be to a
partial page at the end of a file on an ext3 file system. As such,
Red Hat considers this an unlikely scenario.
"
This statement is in the list of bugs fixed.  Is it fixed, or is it
identified?  If it's not fixed, what are the symptoms?

3)
I checked out the patches on the latest kernel (2.4.20-20), which
contain a bunch of new checks in the ext3 block allocation and freeing
routine.  This would lead me to believe that I'm not the only one
seeing the problem.  Also in this patch set
(linux-2.4.20-selected-ac-bits.patch) on line 48466 is the following
change:

@@ -336,7 +335,6 @@ do_more:
    wait_on_buffer (bh);
  }
  if (overflow) {
-   block += count;
    count = overflow;
    goto do_more;
  }

Now, I don't know the fs code very well, but this appears to
completely disable the freeing of block ranges spanning group
boundaries, and results in continuously freeing the same blocks at the
end of the first group over and over until "count" runs out.  It seems
to me (in my ignorance) that freeing block ranges spanning group
boundaries may be a bad thing indeed - but I would think it would
indicate an error in the code calling the free routine, and should not
be handled in the free routine by doing bizarre things.  Again, I may
be wholly ignorant in this.

If my assumptions are correct, it would seem that if freeing blocks
spanning group boundaries is a problem (because metadata is on the
boundry?), that this code would hide the problem - but cause some
blocks that should be freed, to never be freed.

So I'd like to know:
Is this a known bug? What is it? Is it fixed?
Is the code in 2.4.20-20 as scary as it looks?

Comment 3 Bugzilla owner 2004-09-30 15:41:43 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/