Bug 149478 - Bug / data corruption on error handling in Ext3 under I/O failure condition
Bug / data corruption on error handling in Ext3 under I/O failure condition
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
All Linux
medium Severity high
: ---
: ---
Assigned To: Stephen Tweedie
Brian Brock
:
Depends On:
Blocks: 154907 156322
  Show dependency treegraph
 
Reported: 2005-02-23 09:51 EST by Bruce Vessey
Modified: 2007-11-30 17:07 EST (History)
6 users (show)

See Also:
Fixed In Version: RHSA-2005-514
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-10-05 08:46:33 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Bruce Vessey 2005-02-23 09:51:06 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
One of our customers running RHEL 3 reported a data corruption problem.  Analysis by one of our engineers – Chuck Sluder (Charles.Sluder@unisys.com) – determined that this is the same problem that was reported in http://marc.free.net.ph/message/20050104.113409.034bcb1c.en.html  

As that posting notes, this problem exists in 2.4 and 2.6.  Chuck states that Linus has picked up the patch for 2.6.11, but the 2.4 maintainer has not yet picked it up for 2.4.30.

We’re requesting that you include this patch in the next updates of RHEL 3 and RHEL 4.

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1.  Customer is able to reproduce the problem with his SCSI subsystem.
2.
3.
  

Actual Results:  Some I/Os "silently" fail without returning error conditions, thus leading to data corruption.

Expected Results:  Errors should have been reported to the application.

Additional info:

Chuck Sluder's description follows:

In the file filemap.c there are two blocks of code:
  	 if (status >= 0) {
                        written += status;
                        count -= status;
                        pos += status;
                        buf += status;
                }

and
 	if ((status >= 0) && (file->f_flags & O_SYNC))
                status = generic_osync_inode(inode, 1); /* 1 means datasync */

This is where the error is returned.
	err = written ? written : status;

In the first block, status is the number of bytes the write operation committed to. The second block says if any data was commited and O_SYNC is set then call fsync on the inode to force the comitted write to disk.  The status returned by the generic_osync_inode command is either zero or the -EIO error return.  The problem is that if any data was committed then written is always true and the error code from the fsync is ignored. This keeps happening until the write commit fails at which point the EIO error is returned by the commit.  So it looks like the code is working to anyone testing it, but several I/Os silently fail before the error gets reported.  In the Hitachi case the RAID recovers before the commit fails so they will never see any errors except in the log.

I searched the mailing lists for a deletion of the bad line and found a patch for this problem was submitted on 1/4/2005.  Linus has picked it up for 2.6.11. The 2.4 maintainer has not yet picked up the patch for 2.4.30.

You can find a copy of the patch submital here 
	http://marc.free.net.ph/message/20050104.113409.034bcb1c.en.html
Or search the kernel mailing list for the subject
	BUG on error handlings in Ext3 under I/O failure condition
Comment 8 Bruce Vessey 2005-08-24 10:57:09 EDT
I'm curious about the status of this bug.  Did it make it into U2?  If not, what
are your plans?
Comment 9 Stephen Tweedie 2005-08-24 11:15:42 EDT
This fix is in the U2 tree and is planned for the U2 release, yes.
Comment 13 Red Hat Bugzilla 2005-10-05 08:46:34 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-514.html

Note You need to log in before you can comment on or make changes to this bug.