This service will be undergoing maintenance at 00:00 UTC, 2016-09-28. It is expected to last about 1 hours
Bug 108613 - raid5 corruption whenever drive is lost
raid5 corruption whenever drive is lost
Status: CLOSED DUPLICATE of bug 109251
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
9
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-10-30 13:09 EST by Hrunting Johnson
Modified: 2007-04-18 12:58 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-02-21 13:59:34 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Hrunting Johnson 2003-10-30 13:09:23 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5)
Gecko/20031007 Firebird/0.7

Description of problem:
We have several dual-Xeon, hyperthreaded systems with 12 and 16 port PATA drives
that are setup with all drives partitioned and then included in four raid5
arrays.  Occasionally, drives in these systems will fail.  Sometimes (not all
the time, probably about 75% of the time), when a drive fails and gets failed
out of the array, data on the array becomes corrupt, as evidenced by numeorous
EXT3-fs file corruption errors.  The data is lost.  Almost everything is blown away.

In all cases, we're using 3ware cards, but some are 3ware SATA cards (with PATA
to SATA converters) and others are straight 3ware PATA cards.  Drives are all
Western Digital, but they vary in size and make.

These arrays are under heavy read/write I/O (iostat reports drive util at 70-80%
constantly).

Version-Release number of selected component (if applicable):
kernel-2.4.20-20.9

How reproducible:
Sometimes

Steps to Reproduce:
1. build RAID5 array
2. wait for drive to fail with Medium Sense error
3. witness corruption
    

Actual Results:  Filesystem atop affected RAID5 array corrupts

Expected Results:  Filesystem atop affected RAID5 array does not corrupt, as
array is redundant and can afford to lose one drive.

Additional info:

I have been in contact with both 3ware and Western Digital.  We have gone very
low-level on this.  The 3ware looks like it is passing back the errors from the
drives correctly.  Western Digital confirms that the drives are seeing the
errors and reporting them, and they coincide with the log messages being sent to
syslog from the 3ware.  The drives typically fail out only after the read
request (and it's always a read request that fails, usually with an ECC error
from the drive) fails 5 times.  Corruption is only evident AFTER the drive is
removed.
Comment 1 Hrunting Johnson 2003-11-03 10:15:20 EST
I now have systems, again with 3ware cards (7810s, JBOD mode, PATA 
cards), that produce this corruption, this time with Seagate drives.
Comment 2 Hrunting Johnson 2003-11-04 09:20:05 EST
This is definitely a problem with the raid5 code.  I have a system 
with 16 drives.  I partition each drive into 4 partitions, then 
create 4 RAID5 arrays with 1 partition from each drive.  I then put 
an array under intensive I/O (both read and write) and then 
do 'raidsetfaulyt <array> <partition>'.  Immediately I see filesystem 
corruption on the filesystem on that array.  Here are a few of the 
messages:

Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1949435972, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1773623376, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1847329996, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=2080538872, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1963576516, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1813626640, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1945409144, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1781930152, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1795790536, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1872381408, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=2020518828, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1796667100, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=2006427588, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1949435972, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1949435972, 
limit=1751284800
Comment 3 Hrunting Johnson 2003-11-04 09:54:16 EST
Also, with the direct raidsetfaulty method, this is ALWAYS 
reproducible.
Comment 4 Hrunting Johnson 2003-11-06 17:55:22 EST
The bug report in 109251 gives a much more detailed explanation of 
why I believe this happening.

*** This bug has been marked as a duplicate of 109251 ***
Comment 5 Red Hat Bugzilla 2006-02-21 13:59:34 EST
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.

Note You need to log in before you can comment on or make changes to this bug.