Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 102984 - Software RAID does not mark bad drive failed; corruption ensues
Software RAID does not mark bad drive failed; corruption ensues
Status: CLOSED WONTFIX
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
9
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-08-24 02:24 EDT by Hrunting Johnson
Modified: 2007-04-18 12:57 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-09-30 11:41:28 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Hrunting Johnson 2003-08-24 02:24:08 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030811
Mozilla Firebird/0.6.1

Description of problem:
We have a 12-port 3ware 8500-12 card with 12 WD 250GB drives using software
RAID5 across all drives.  Some of those drives occasionally fail, and we see
errors like this in our logs:

3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0xd0, unit #11.

Port 11 is most likely a bad drive, maybe a bad cable, but regardless, there are
errors writing to and/or reading from disk.  However, the software RAID5 never
marks that drive as failed.  Those errors keep spewing to the logs (as if the
system is still attempting to write to the bad port) and the machine becomes
unresponsive.  The only sign that it is alive is the constant repeating log
message.  A manual power-cycle is required.  On reboot, the errors again appear
as the drive is remounted and a resync begins and the machine fails to boot.  If
by some miracle the drive is marked bad across all RAIDs and an fsck is
performed on the ext3 filesystem(s), there is almost always severe data corruption.

Version-Release number of selected component (if applicable):
kernel-2.4.20-19.9

How reproducible:
Always

Steps to Reproduce:
1.Run software RAID5 on 3ware 8500-12
2.Have bad drive
3.Watch error logs
    

Actual Results:  This error (the "flags =" part varies among results like 0xd0,
0x40, and 0x57):

3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0xd0, unit #11.

is printed repeatedly to either the error log or console and the machine
requires a reboot.

Expected Results:  The system stops trying to write to unit #11 completely and
just marks the drive failed until it is either manually resynced or replaced.

Additional info:

3ware driver ranges from the default version shipped with the 2.4.20-19.9 kernel
(1.02.00.032) to the latest driver matching the latest firmware on the card
(1.02.00.036).  I have contacted 3ware regarding this issue and they're mostly
unhelpful.  As far as they're concerned, the driver is reporting the error to
the kernel and it's up to the operating system and/or software RAID application
to mark the drive failed.
Comment 1 Bugzilla owner 2004-09-30 11:41:28 EDT
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.