Bug 102984

Summary: Software RAID does not mark bad drive failed; corruption ensues
Product: [Retired] Red Hat Linux Reporter: Hrunting Johnson <hrunting>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 9CC: riel
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:41:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hrunting Johnson 2003-08-24 06:24:08 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030811
Mozilla Firebird/0.6.1

Description of problem:
We have a 12-port 3ware 8500-12 card with 12 WD 250GB drives using software
RAID5 across all drives.  Some of those drives occasionally fail, and we see
errors like this in our logs:

3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0xd0, unit #11.

Port 11 is most likely a bad drive, maybe a bad cable, but regardless, there are
errors writing to and/or reading from disk.  However, the software RAID5 never
marks that drive as failed.  Those errors keep spewing to the logs (as if the
system is still attempting to write to the bad port) and the machine becomes
unresponsive.  The only sign that it is alive is the constant repeating log
message.  A manual power-cycle is required.  On reboot, the errors again appear
as the drive is remounted and a resync begins and the machine fails to boot.  If
by some miracle the drive is marked bad across all RAIDs and an fsck is
performed on the ext3 filesystem(s), there is almost always severe data corruption.

Version-Release number of selected component (if applicable):
kernel-2.4.20-19.9

How reproducible:
Always

Steps to Reproduce:
1.Run software RAID5 on 3ware 8500-12
2.Have bad drive
3.Watch error logs
    

Actual Results:  This error (the "flags =" part varies among results like 0xd0,
0x40, and 0x57):

3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0xd0, unit #11.

is printed repeatedly to either the error log or console and the machine
requires a reboot.

Expected Results:  The system stops trying to write to unit #11 completely and
just marks the drive failed until it is either manually resynced or replaced.

Additional info:

3ware driver ranges from the default version shipped with the 2.4.20-19.9 kernel
(1.02.00.032) to the latest driver matching the latest firmware on the card
(1.02.00.036).  I have contacted 3ware regarding this issue and they're mostly
unhelpful.  As far as they're concerned, the driver is reporting the error to
the kernel and it's up to the operating system and/or software RAID application
to mark the drive failed.

Comment 1 Bugzilla owner 2004-09-30 15:41:28 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/