Bug 113863

Summary:	raid 5 devices should offline themselves on multi-drive failure
Product:	Red Hat Enterprise Linux 2.1	Reporter:	Phil D'Amore <damorep>
Component:	kernel	Assignee:	Jim Paradis <jparadis>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.1	CC:	peterm, riel
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-12-02 22:50:50 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Phil D'Amore 2004-01-19 16:32:44 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030314

Description of problem:
This weekend we experienced a multi-drive failure on a raid 5 array
that was likely due to temperature issues.  The /proc/mdstat output
for the array is:

md8 : active raid5 sdl1[8] sdk1[10] sdj1[9] sdi1[3] sdh1[7] sdg1[6]
sdf1[5] sde1[4] sdc1[2] sdb1[1] sda1[0]
      104903040 blocks level 5, 64k chunk, algorithm 0 [11/11]
[UUUUUUUUUUU]

Of course, that is the current view, after the issues were resolved.

What I saw in /var/log/messages was a gradual degradation of this
array.  I grep'd the failures out and you can see them here:

Jan 17 05:57:16 redline kernel: raid5: Disk failure on sda1, disabling
device. Operation continuing on 10 devices
Jan 17 06:03:45 redline kernel: raid5: Disk failure on sdc1, disabling
device. Operation continuing on 9 devices
Jan 17 06:05:17 redline kernel: raid5: Disk failure on sdb1, disabling
device. Operation continuing on 8 devices
Jan 17 06:06:49 redline kernel: raid5: Disk failure on sdf1, disabling
device. Operation continuing on 7 devices
Jan 17 06:11:25 redline kernel: raid5: Disk failure on sdg1, disabling
device. Operation continuing on 6 devices

Anything less than 10 drives for this array should not work, but I
still get these operation continuing messages.  When this machine came
back up, the use of mkraid brought the array back, but the underlying
filesystem was severely corrupted, and I wound up doing a restore.

I should note here that md9 on this host is of the same configuration,
and failed in the same way, but no corruption resulted after I brought
the array back and checked the fs.

It seems that if a RAID 5 array falls to having less disks than the
minimum required, the device should deactivate itself to protect the
data that is on it in a consistent manner.

Version-Release number of selected component (if applicable):


How reproducible:
Didn't try

Steps to Reproduce:
Have a multi drive failure on a RAID 5 device.

Observe that "operation continues", and failures like this don't seem
to  yield consistent results.

Expected Results:  It seems to me that the array should be taken
offline, preventing further corruption of the underlying data.  Also,
an option to panic the box in this case may also be a nice thing to
prevent further data loss, since IMHO, data loss from multi drive
failures can many times be avoided if the system just stops when it
happens.

Additional info:

Comment 1 Jim Paradis 2005-12-02 22:50:50 UTC

This issue is beyond the scope of the current support status of RHEL 2.1.  No
fix is planned.