102984 – Software RAID does not mark bad drive failed; corruption ensues

Bug 102984 - Software RAID does not mark bad drive failed; corruption ensues

Summary: Software RAID does not mark bad drive failed; corruption ensues

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	9
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-08-24 06:24 UTC by Hrunting Johnson
Modified:	2007-04-18 16:57 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-30 15:41:28 UTC
Embargoed:

Attachments	(Terms of Use)

Description Hrunting Johnson 2003-08-24 06:24:08 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030811
Mozilla Firebird/0.6.1

Description of problem:
We have a 12-port 3ware 8500-12 card with 12 WD 250GB drives using software
RAID5 across all drives.  Some of those drives occasionally fail, and we see
errors like this in our logs:

3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0xd0, unit #11.

Port 11 is most likely a bad drive, maybe a bad cable, but regardless, there are
errors writing to and/or reading from disk.  However, the software RAID5 never
marks that drive as failed.  Those errors keep spewing to the logs (as if the
system is still attempting to write to the bad port) and the machine becomes
unresponsive.  The only sign that it is alive is the constant repeating log
message.  A manual power-cycle is required.  On reboot, the errors again appear
as the drive is remounted and a resync begins and the machine fails to boot.  If
by some miracle the drive is marked bad across all RAIDs and an fsck is
performed on the ext3 filesystem(s), there is almost always severe data corruption.

Version-Release number of selected component (if applicable):
kernel-2.4.20-19.9

How reproducible:
Always

Steps to Reproduce:
1.Run software RAID5 on 3ware 8500-12
2.Have bad drive
3.Watch error logs
    

Actual Results:  This error (the "flags =" part varies among results like 0xd0,
0x40, and 0x57):

3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0xd0, unit #11.

is printed repeatedly to either the error log or console and the machine
requires a reboot.

Expected Results:  The system stops trying to write to unit #11 completely and
just marks the drive failed until it is either manually resynced or replaced.

Additional info:

3ware driver ranges from the default version shipped with the 2.4.20-19.9 kernel
(1.02.00.032) to the latest driver matching the latest firmware on the card
(1.02.00.036).  I have contacted 3ware regarding this issue and they're mostly
unhelpful.  As far as they're concerned, the driver is reporting the error to
the kernel and it's up to the operating system and/or software RAID application
to mark the drive failed.

Comment 1 Bugzilla owner 2004-09-30 15:41:28 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.