Bug 149478

Summary: Bug / data corruption on error handling in Ext3 under I/O failure condition
Product: Red Hat Enterprise Linux 4 Reporter: Bruce Vessey <bruce.vessey>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: charles.sluder, davej, k.georgiou, poelstra, riel, sct
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2005-514 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-05 12:46:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 154907, 156322    

Description Bruce Vessey 2005-02-23 14:51:06 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
One of our customers running RHEL 3 reported a data corruption problem.  Analysis by one of our engineers  Chuck Sluder (Charles.Sluder)  determined that this is the same problem that was reported in http://marc.free.net.ph/message/20050104.113409.034bcb1c.en.html  

As that posting notes, this problem exists in 2.4 and 2.6.  Chuck states that Linus has picked up the patch for 2.6.11, but the 2.4 maintainer has not yet picked it up for 2.4.30.

Were requesting that you include this patch in the next updates of RHEL 3 and RHEL 4.

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1.  Customer is able to reproduce the problem with his SCSI subsystem.
2.
3.
  

Actual Results:  Some I/Os "silently" fail without returning error conditions, thus leading to data corruption.

Expected Results:  Errors should have been reported to the application.

Additional info:

Chuck Sluder's description follows:

In the file filemap.c there are two blocks of code:
  	 if (status >= 0) {
                        written += status;
                        count -= status;
                        pos += status;
                        buf += status;
                }

and
 	if ((status >= 0) && (file->f_flags & O_SYNC))
                status = generic_osync_inode(inode, 1); /* 1 means datasync */

This is where the error is returned.
	err = written ? written : status;

In the first block, status is the number of bytes the write operation committed to. The second block says if any data was commited and O_SYNC is set then call fsync on the inode to force the comitted write to disk.  The status returned by the generic_osync_inode command is either zero or the -EIO error return.  The problem is that if any data was committed then written is always true and the error code from the fsync is ignored. This keeps happening until the write commit fails at which point the EIO error is returned by the commit.  So it looks like the code is working to anyone testing it, but several I/Os silently fail before the error gets reported.  In the Hitachi case the RAID recovers before the commit fails so they will never see any errors except in the log.

I searched the mailing lists for a deletion of the bad line and found a patch for this problem was submitted on 1/4/2005.  Linus has picked it up for 2.6.11. The 2.4 maintainer has not yet picked up the patch for 2.4.30.

You can find a copy of the patch submital here 
	http://marc.free.net.ph/message/20050104.113409.034bcb1c.en.html
Or search the kernel mailing list for the subject
	BUG on error handlings in Ext3 under I/O failure condition

Comment 8 Bruce Vessey 2005-08-24 14:57:09 UTC
I'm curious about the status of this bug.  Did it make it into U2?  If not, what
are your plans?

Comment 9 Stephen Tweedie 2005-08-24 15:15:42 UTC
This fix is in the U2 tree and is planned for the U2 release, yes.

Comment 13 Red Hat Bugzilla 2005-10-05 12:46:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-514.html