Bug 119033

Summary: Random ext3 filesystem corruption under heavy disk activity load
Product: Red Hat Enterprise Linux 3 Reporter: Benjamin Franz <snowhare>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: leonard-rh-bugzilla, petrides, riel
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-09-19 18:44:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Benjamin Franz 2004-03-24 01:02:30 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1)
Gecko/20021003

Description of problem:
We originally installed a RHEL3 system on a dual processor Xeon
hyperthreaded P4 system. After about three weeks of uptime, it
developed ext3 filesystem corruption (random files would suddenly
appear as if their sizes were in the multi-terabyte range for
example). It repeatedly developed filesystem corruption even after
being fscked and so we replaced the server with a nearly identical
machine running RH9, and a single processor (a hyperthreaded p4 Xeon).
It _also_ developed ext3 filesystem corruption after about 3 weeks of
uptime. When I attempted to delete a corrupted file entry, the entire
server crashed and could not be recovered using fsck.



Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Install RHEL3/RH9 to a dual or single p4 Xeon system with
hyperthreading enabled and 3ware SATA raid5 system with 1 gigabyte of
RAM. Disable 'atime' for the partitions.
2.Install qmail mail server
3.Run under sustained mail traffic load (~40,000 messages per day) for
roughly 3 weeks
4.Run nightly rsync backups of entire server
    

Actual Results:  Corruption in random places of the ext3 filesystem -
the corruption appears _anywhere_ in the filesystem, even in
directories where nothing has been modified.

Expected Results:  No filesystem corruption

Additional info:

Our RHEL3 server id is 1004130933. The second box is identical, except
it was running RH9 and only had one processor instead of two.

Comment 1 Benjamin Franz 2004-03-25 02:16:21 UTC
I've been doing some Google digging, and discovered this may be a
3ware hardware issue. There is a thread at
http://forums.storagereview.net/index.php?showtopic=14162 that
indicates that 3ware 66Mhz products have a serious problem on Intel
750X chipset and some AMD boards - particularly if using a
manufacturer riser board.

3ware appears to be trying to keep a low profile on it, but there is a
technical brief on it at
https://www.3ware.com/kbadmin/attachments/TM900-0045-00%20Rev%20A_P.pdf

Comment 3 Doug Ledford 2006-09-19 18:44:38 UTC
As the second comment pointed out, this would appeared to be a 3Ware issue.  We
didn't get any other reports of ext3 corruption like this.  I'm closing this bug
out as NOTABUG since it appears it was a hardware issue.