Bug 171354

Summary:

Data/FS corruption caused by FS activity during RAID6 resync

Product:

Red Hat Enterprise Linux 4

Reporter:

Joshua Baker-LePain <joshua.bakerlepain>

Component:

kernel

Assignee:

Alasdair Kergon <agk>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

4.0

CC:

agk, dwysocha, jbaron, mbroz, milan.kerslager, paulw, phil, rwheeler, sct, sean, trevor

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-02-10 01:15:20 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Log snippet showing FS errors (hostname removed).	none

Description Joshua Baker-LePain 2005-10-21 00:15:58 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050811 CentOS/1.0.6-1.4.1.centos3 Firefox/1.0.6

Description of problem:
I'm testing a 15 drive RAID6 using the system tester at <http://people.redhat.com/dledford/memtest.html>.  Running this test on a clean array worked fine.  I then failed a drive in the array with the script running.  The resync started as it should.  Soon, though, the script started returning errors, and then I saw several EXT3-fs errors in the logs (I'll put examples in an attachment).

Version-Release number of selected component (if applicable):
kernel-2.6.9-22.ELsmp

How reproducible:
Always

Steps to Reproduce:
1. Create and format RAID6 array:
mdadm -C /dev/md0 -c 128 -l 6 -n 15 -x 1 /dev/sd[a-e]2 /dev/sd[f-p]1
mke2fs -b 4096 -j -m 0 -R stride=32 -T largefile /dev/md0
2. Mount array and run memtest.sh found at the web address above.
3. Fail a drive:
mdadm /dev/md0 -f /dev/sdi1
  

Actual Results:  Data, and eventually FS, corruption.

Expected Results:  The array should rebuild without corrupting anything.

Additional info:

I've tried this on 2 similar servers.  Both have 2 3ware 7500-8 controllers in JBOD mode.  One server has a Supermicro P4DPE-G2 motherboard, 4GB RAM, dual 2.2GHz  Xeons, and 16 Maxtor 160GB drives.  The other has a Supermicro X5DPE-G2 board, 2GB RAM, dual 2.4GHz Xeons, and 16 IBM 180GB drives.

Comment 1 Joshua Baker-LePain 2005-10-21 00:20:03 UTC

Created attachment 120227 [details]
Log snippet showing FS errors (hostname removed).

Comment 2 Philippe Troin 2007-05-24 01:10:52 UTC

I'm seeing the same problem with RHEL4.4 as well.
Actually, the RAID fs is corrupted right out of install.

This patch might fix the problem:
http://linux.bkbits.net:8080/linux-2.6/?PAGE=gnupatch&REV=1.1938.340.65

Phil.

Comment 3 Philippe Troin 2007-05-26 01:26:40 UTC

Fixed in 4.5.
Thanks.
Phil.

Comment 4 Philippe Troin 2007-05-30 20:49:11 UTC

Actually, no, I am still witnessing corruption with 2.6.9-55.ELsmp.
RAID-6 is a no-go.
Phil.

Comment 5 Alasdair Kergon 2011-02-10 01:15:20 UTC

This got routed to the wrong place, I'm afraid - it's not my area.  If you still need support please use https://www.redhat.com/support/ referencing this bugzilla to obtain more personal attention.