Bug 108613

Summary: raid5 corruption whenever drive is lost
Product: [Retired] Red Hat Linux Reporter: Hrunting Johnson <hrunting>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 9CC: riel
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-21 18:59:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hrunting Johnson 2003-10-30 18:09:23 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5)
Gecko/20031007 Firebird/0.7

Description of problem:
We have several dual-Xeon, hyperthreaded systems with 12 and 16 port PATA drives
that are setup with all drives partitioned and then included in four raid5
arrays.  Occasionally, drives in these systems will fail.  Sometimes (not all
the time, probably about 75% of the time), when a drive fails and gets failed
out of the array, data on the array becomes corrupt, as evidenced by numeorous
EXT3-fs file corruption errors.  The data is lost.  Almost everything is blown away.

In all cases, we're using 3ware cards, but some are 3ware SATA cards (with PATA
to SATA converters) and others are straight 3ware PATA cards.  Drives are all
Western Digital, but they vary in size and make.

These arrays are under heavy read/write I/O (iostat reports drive util at 70-80%
constantly).

Version-Release number of selected component (if applicable):
kernel-2.4.20-20.9

How reproducible:
Sometimes

Steps to Reproduce:
1. build RAID5 array
2. wait for drive to fail with Medium Sense error
3. witness corruption
    

Actual Results:  Filesystem atop affected RAID5 array corrupts

Expected Results:  Filesystem atop affected RAID5 array does not corrupt, as
array is redundant and can afford to lose one drive.

Additional info:

I have been in contact with both 3ware and Western Digital.  We have gone very
low-level on this.  The 3ware looks like it is passing back the errors from the
drives correctly.  Western Digital confirms that the drives are seeing the
errors and reporting them, and they coincide with the log messages being sent to
syslog from the 3ware.  The drives typically fail out only after the read
request (and it's always a read request that fails, usually with an ECC error
from the drive) fails 5 times.  Corruption is only evident AFTER the drive is
removed.

Comment 1 Hrunting Johnson 2003-11-03 15:15:20 UTC
I now have systems, again with 3ware cards (7810s, JBOD mode, PATA 
cards), that produce this corruption, this time with Seagate drives.

Comment 2 Hrunting Johnson 2003-11-04 14:20:05 UTC
This is definitely a problem with the raid5 code.  I have a system 
with 16 drives.  I partition each drive into 4 partitions, then 
create 4 RAID5 arrays with 1 partition from each drive.  I then put 
an array under intensive I/O (both read and write) and then 
do 'raidsetfaulyt <array> <partition>'.  Immediately I see filesystem 
corruption on the filesystem on that array.  Here are a few of the 
messages:

Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1949435972, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1773623376, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1847329996, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=2080538872, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1963576516, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1813626640, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1945409144, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1781930152, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1795790536, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1872381408, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=2020518828, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1796667100, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=2006427588, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1949435972, 
limit=1751284800
Nov  4 08:17:29 r25 kernel: attempt to access beyond end of device
Nov  4 08:17:29 r25 kernel: 09:04: rw=0, want=1949435972, 
limit=1751284800

Comment 3 Hrunting Johnson 2003-11-04 14:54:16 UTC
Also, with the direct raidsetfaulty method, this is ALWAYS 
reproducible.

Comment 4 Hrunting Johnson 2003-11-06 22:55:22 UTC
The bug report in 109251 gives a much more detailed explanation of 
why I believe this happening.

*** This bug has been marked as a duplicate of 109251 ***

Comment 5 Red Hat Bugzilla 2006-02-21 18:59:34 UTC
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.