Bug 113863 - raid 5 devices should offline themselves on multi-drive failure
Summary: raid 5 devices should offline themselves on multi-drive failure
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel
Version: 2.1
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jim Paradis
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-01-19 16:32 UTC by Phil D'Amore
Modified: 2013-08-06 01:03 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-12-02 22:50:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Phil D'Amore 2004-01-19 16:32:44 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030314

Description of problem:
This weekend we experienced a multi-drive failure on a raid 5 array
that was likely due to temperature issues.  The /proc/mdstat output
for the array is:

md8 : active raid5 sdl1[8] sdk1[10] sdj1[9] sdi1[3] sdh1[7] sdg1[6]
sdf1[5] sde1[4] sdc1[2] sdb1[1] sda1[0]
      104903040 blocks level 5, 64k chunk, algorithm 0 [11/11]
[UUUUUUUUUUU]

Of course, that is the current view, after the issues were resolved.

What I saw in /var/log/messages was a gradual degradation of this
array.  I grep'd the failures out and you can see them here:

Jan 17 05:57:16 redline kernel: raid5: Disk failure on sda1, disabling
device. Operation continuing on 10 devices
Jan 17 06:03:45 redline kernel: raid5: Disk failure on sdc1, disabling
device. Operation continuing on 9 devices
Jan 17 06:05:17 redline kernel: raid5: Disk failure on sdb1, disabling
device. Operation continuing on 8 devices
Jan 17 06:06:49 redline kernel: raid5: Disk failure on sdf1, disabling
device. Operation continuing on 7 devices
Jan 17 06:11:25 redline kernel: raid5: Disk failure on sdg1, disabling
device. Operation continuing on 6 devices

Anything less than 10 drives for this array should not work, but I
still get these operation continuing messages.  When this machine came
back up, the use of mkraid brought the array back, but the underlying
filesystem was severely corrupted, and I wound up doing a restore.

I should note here that md9 on this host is of the same configuration,
and failed in the same way, but no corruption resulted after I brought
the array back and checked the fs.

It seems that if a RAID 5 array falls to having less disks than the
minimum required, the device should deactivate itself to protect the
data that is on it in a consistent manner.

Version-Release number of selected component (if applicable):


How reproducible:
Didn't try

Steps to Reproduce:
Have a multi drive failure on a RAID 5 device.

Observe that "operation continues", and failures like this don't seem
to  yield consistent results.

Expected Results:  It seems to me that the array should be taken
offline, preventing further corruption of the underlying data.  Also,
an option to panic the box in this case may also be a nice thing to
prevent further data loss, since IMHO, data loss from multi drive
failures can many times be avoided if the system just stops when it
happens.

Additional info:

Comment 1 Jim Paradis 2005-12-02 22:50:50 UTC
This issue is beyond the scope of the current support status of RHEL 2.1.  No
fix is planned.



Note You need to log in before you can comment on or make changes to this bug.