171354 – Data/FS corruption caused by FS activity during RAID6 resync

Bug 171354 - Data/FS corruption caused by FS activity during RAID6 resync

Summary: Data/FS corruption caused by FS activity during RAID6 resync

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Alasdair Kergon
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-10-21 00:15 UTC by Joshua Baker-LePain
Modified:	2011-02-10 01:15 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-02-10 01:15:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Log snippet showing FS errors (hostname removed). (3.76 KB, text/plain) 2005-10-21 00:20 UTC, Joshua Baker-LePain	no flags	Details
View All

Description Joshua Baker-LePain 2005-10-21 00:15:58 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050811 CentOS/1.0.6-1.4.1.centos3 Firefox/1.0.6

Description of problem:
I'm testing a 15 drive RAID6 using the system tester at <http://people.redhat.com/dledford/memtest.html>.  Running this test on a clean array worked fine.  I then failed a drive in the array with the script running.  The resync started as it should.  Soon, though, the script started returning errors, and then I saw several EXT3-fs errors in the logs (I'll put examples in an attachment).

Version-Release number of selected component (if applicable):
kernel-2.6.9-22.ELsmp

How reproducible:
Always

Steps to Reproduce:
1. Create and format RAID6 array:
mdadm -C /dev/md0 -c 128 -l 6 -n 15 -x 1 /dev/sd[a-e]2 /dev/sd[f-p]1
mke2fs -b 4096 -j -m 0 -R stride=32 -T largefile /dev/md0
2. Mount array and run memtest.sh found at the web address above.
3. Fail a drive:
mdadm /dev/md0 -f /dev/sdi1
  

Actual Results:  Data, and eventually FS, corruption.

Expected Results:  The array should rebuild without corrupting anything.

Additional info:

I've tried this on 2 similar servers.  Both have 2 3ware 7500-8 controllers in JBOD mode.  One server has a Supermicro P4DPE-G2 motherboard, 4GB RAM, dual 2.2GHz  Xeons, and 16 Maxtor 160GB drives.  The other has a Supermicro X5DPE-G2 board, 2GB RAM, dual 2.4GHz Xeons, and 16 IBM 180GB drives.

Comment 1 Joshua Baker-LePain 2005-10-21 00:20:03 UTC

Created attachment 120227 [details]
Log snippet showing FS errors (hostname removed).

Comment 2 Philippe Troin 2007-05-24 01:10:52 UTC

I'm seeing the same problem with RHEL4.4 as well.
Actually, the RAID fs is corrupted right out of install.

This patch might fix the problem:
http://linux.bkbits.net:8080/linux-2.6/?PAGE=gnupatch&REV=1.1938.340.65

Phil.

Comment 3 Philippe Troin 2007-05-26 01:26:40 UTC

Fixed in 4.5.
Thanks.
Phil.

Comment 4 Philippe Troin 2007-05-30 20:49:11 UTC

Actually, no, I am still witnessing corruption with 2.6.9-55.ELsmp.
RAID-6 is a no-go.
Phil.

Comment 5 Alasdair Kergon 2011-02-10 01:15:20 UTC

This got routed to the wrong place, I'm afraid - it's not my area.  If you still need support please use https://www.redhat.com/support/ referencing this bugzilla to obtain more personal attention.

Note You need to log in before you can comment on or make changes to this bug.