From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Description of problem: One of our customers running RHEL 3 reported a data corruption problem. Analysis by one of our engineers Chuck Sluder (Charles.Sluder) determined that this is the same problem that was reported in http://marc.free.net.ph/message/20050104.113409.034bcb1c.en.html As that posting notes, this problem exists in 2.4 and 2.6. Chuck states that Linus has picked up the patch for 2.6.11, but the 2.4 maintainer has not yet picked it up for 2.4.30. Were requesting that you include this patch in the next updates of RHEL 3 and RHEL 4. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1. Customer is able to reproduce the problem with his SCSI subsystem. 2. 3. Actual Results: Some I/Os "silently" fail without returning error conditions, thus leading to data corruption. Expected Results: Errors should have been reported to the application. Additional info: Chuck Sluder's description follows: In the file filemap.c there are two blocks of code: if (status >= 0) { written += status; count -= status; pos += status; buf += status; } and if ((status >= 0) && (file->f_flags & O_SYNC)) status = generic_osync_inode(inode, 1); /* 1 means datasync */ This is where the error is returned. err = written ? written : status; In the first block, status is the number of bytes the write operation committed to. The second block says if any data was commited and O_SYNC is set then call fsync on the inode to force the comitted write to disk. The status returned by the generic_osync_inode command is either zero or the -EIO error return. The problem is that if any data was committed then written is always true and the error code from the fsync is ignored. This keeps happening until the write commit fails at which point the EIO error is returned by the commit. So it looks like the code is working to anyone testing it, but several I/Os silently fail before the error gets reported. In the Hitachi case the RAID recovers before the commit fails so they will never see any errors except in the log. I searched the mailing lists for a deletion of the bad line and found a patch for this problem was submitted on 1/4/2005. Linus has picked it up for 2.6.11. The 2.4 maintainer has not yet picked up the patch for 2.4.30. You can find a copy of the patch submital here http://marc.free.net.ph/message/20050104.113409.034bcb1c.en.html Or search the kernel mailing list for the subject BUG on error handlings in Ext3 under I/O failure condition
I'm curious about the status of this bug. Did it make it into U2? If not, what are your plans?
This fix is in the U2 tree and is planned for the U2 release, yes.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-514.html