Bug 171354 - Data/FS corruption caused by FS activity during RAID6 resync
Summary: Data/FS corruption caused by FS activity during RAID6 resync
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Alasdair Kergon
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-10-21 00:15 UTC by Joshua Baker-LePain
Modified: 2011-02-10 01:15 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-02-10 01:15:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Log snippet showing FS errors (hostname removed). (3.76 KB, text/plain)
2005-10-21 00:20 UTC, Joshua Baker-LePain
no flags Details

Description Joshua Baker-LePain 2005-10-21 00:15:58 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050811 CentOS/1.0.6-1.4.1.centos3 Firefox/1.0.6

Description of problem:
I'm testing a 15 drive RAID6 using the system tester at <http://people.redhat.com/dledford/memtest.html>.  Running this test on a clean array worked fine.  I then failed a drive in the array with the script running.  The resync started as it should.  Soon, though, the script started returning errors, and then I saw several EXT3-fs errors in the logs (I'll put examples in an attachment).

Version-Release number of selected component (if applicable):
kernel-2.6.9-22.ELsmp

How reproducible:
Always

Steps to Reproduce:
1. Create and format RAID6 array:
mdadm -C /dev/md0 -c 128 -l 6 -n 15 -x 1 /dev/sd[a-e]2 /dev/sd[f-p]1
mke2fs -b 4096 -j -m 0 -R stride=32 -T largefile /dev/md0
2. Mount array and run memtest.sh found at the web address above.
3. Fail a drive:
mdadm /dev/md0 -f /dev/sdi1
  

Actual Results:  Data, and eventually FS, corruption.

Expected Results:  The array should rebuild without corrupting anything.

Additional info:

I've tried this on 2 similar servers.  Both have 2 3ware 7500-8 controllers in JBOD mode.  One server has a Supermicro P4DPE-G2 motherboard, 4GB RAM, dual 2.2GHz  Xeons, and 16 Maxtor 160GB drives.  The other has a Supermicro X5DPE-G2 board, 2GB RAM, dual 2.4GHz Xeons, and 16 IBM 180GB drives.

Comment 1 Joshua Baker-LePain 2005-10-21 00:20:03 UTC
Created attachment 120227 [details]
Log snippet showing FS errors (hostname removed).

Comment 2 Philippe Troin 2007-05-24 01:10:52 UTC
I'm seeing the same problem with RHEL4.4 as well.
Actually, the RAID fs is corrupted right out of install.

This patch might fix the problem:
http://linux.bkbits.net:8080/linux-2.6/?PAGE=gnupatch&REV=1.1938.340.65

Phil.

Comment 3 Philippe Troin 2007-05-26 01:26:40 UTC
Fixed in 4.5.
Thanks.
Phil.


Comment 4 Philippe Troin 2007-05-30 20:49:11 UTC
Actually, no, I am still witnessing corruption with 2.6.9-55.ELsmp.
RAID-6 is a no-go.
Phil.

Comment 5 Alasdair Kergon 2011-02-10 01:15:20 UTC
This got routed to the wrong place, I'm afraid - it's not my area.  If you still need support please use https://www.redhat.com/support/ referencing this bugzilla to obtain more personal attention.


Note You need to log in before you can comment on or make changes to this bug.