Bug 171354

Summary: Data/FS corruption caused by FS activity during RAID6 resync
Product: Red Hat Enterprise Linux 4 Reporter: Joshua Baker-LePain <joshua.bakerlepain>
Component: kernelAssignee: Alasdair Kergon <agk>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: agk, dwysocha, jbaron, mbroz, milan.kerslager, paulw, phil, rwheeler, sct, sean, trevor
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-10 01:15:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Log snippet showing FS errors (hostname removed). none

Description Joshua Baker-LePain 2005-10-21 00:15:58 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050811 CentOS/1.0.6-1.4.1.centos3 Firefox/1.0.6

Description of problem:
I'm testing a 15 drive RAID6 using the system tester at <http://people.redhat.com/dledford/memtest.html>.  Running this test on a clean array worked fine.  I then failed a drive in the array with the script running.  The resync started as it should.  Soon, though, the script started returning errors, and then I saw several EXT3-fs errors in the logs (I'll put examples in an attachment).

Version-Release number of selected component (if applicable):
kernel-2.6.9-22.ELsmp

How reproducible:
Always

Steps to Reproduce:
1. Create and format RAID6 array:
mdadm -C /dev/md0 -c 128 -l 6 -n 15 -x 1 /dev/sd[a-e]2 /dev/sd[f-p]1
mke2fs -b 4096 -j -m 0 -R stride=32 -T largefile /dev/md0
2. Mount array and run memtest.sh found at the web address above.
3. Fail a drive:
mdadm /dev/md0 -f /dev/sdi1
  

Actual Results:  Data, and eventually FS, corruption.

Expected Results:  The array should rebuild without corrupting anything.

Additional info:

I've tried this on 2 similar servers.  Both have 2 3ware 7500-8 controllers in JBOD mode.  One server has a Supermicro P4DPE-G2 motherboard, 4GB RAM, dual 2.2GHz  Xeons, and 16 Maxtor 160GB drives.  The other has a Supermicro X5DPE-G2 board, 2GB RAM, dual 2.4GHz Xeons, and 16 IBM 180GB drives.

Comment 1 Joshua Baker-LePain 2005-10-21 00:20:03 UTC
Created attachment 120227 [details]
Log snippet showing FS errors (hostname removed).

Comment 2 Philippe Troin 2007-05-24 01:10:52 UTC
I'm seeing the same problem with RHEL4.4 as well.
Actually, the RAID fs is corrupted right out of install.

This patch might fix the problem:
http://linux.bkbits.net:8080/linux-2.6/?PAGE=gnupatch&REV=1.1938.340.65

Phil.

Comment 3 Philippe Troin 2007-05-26 01:26:40 UTC
Fixed in 4.5.
Thanks.
Phil.


Comment 4 Philippe Troin 2007-05-30 20:49:11 UTC
Actually, no, I am still witnessing corruption with 2.6.9-55.ELsmp.
RAID-6 is a no-go.
Phil.

Comment 5 Alasdair Kergon 2011-02-10 01:15:20 UTC
This got routed to the wrong place, I'm afraid - it's not my area.  If you still need support please use https://www.redhat.com/support/ referencing this bugzilla to obtain more personal attention.