Bug 113863

Summary: raid 5 devices should offline themselves on multi-drive failure
Product: Red Hat Enterprise Linux 2.1 Reporter: Phil D'Amore <damorep>
Component: kernelAssignee: Jim Paradis <jparadis>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1CC: peterm, riel
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-12-02 22:50:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Phil D'Amore 2004-01-19 16:32:44 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030314

Description of problem:
This weekend we experienced a multi-drive failure on a raid 5 array
that was likely due to temperature issues.  The /proc/mdstat output
for the array is:

md8 : active raid5 sdl1[8] sdk1[10] sdj1[9] sdi1[3] sdh1[7] sdg1[6]
sdf1[5] sde1[4] sdc1[2] sdb1[1] sda1[0]
      104903040 blocks level 5, 64k chunk, algorithm 0 [11/11]
[UUUUUUUUUUU]

Of course, that is the current view, after the issues were resolved.

What I saw in /var/log/messages was a gradual degradation of this
array.  I grep'd the failures out and you can see them here:

Jan 17 05:57:16 redline kernel: raid5: Disk failure on sda1, disabling
device. Operation continuing on 10 devices
Jan 17 06:03:45 redline kernel: raid5: Disk failure on sdc1, disabling
device. Operation continuing on 9 devices
Jan 17 06:05:17 redline kernel: raid5: Disk failure on sdb1, disabling
device. Operation continuing on 8 devices
Jan 17 06:06:49 redline kernel: raid5: Disk failure on sdf1, disabling
device. Operation continuing on 7 devices
Jan 17 06:11:25 redline kernel: raid5: Disk failure on sdg1, disabling
device. Operation continuing on 6 devices

Anything less than 10 drives for this array should not work, but I
still get these operation continuing messages.  When this machine came
back up, the use of mkraid brought the array back, but the underlying
filesystem was severely corrupted, and I wound up doing a restore.

I should note here that md9 on this host is of the same configuration,
and failed in the same way, but no corruption resulted after I brought
the array back and checked the fs.

It seems that if a RAID 5 array falls to having less disks than the
minimum required, the device should deactivate itself to protect the
data that is on it in a consistent manner.

Version-Release number of selected component (if applicable):


How reproducible:
Didn't try

Steps to Reproduce:
Have a multi drive failure on a RAID 5 device.

Observe that "operation continues", and failures like this don't seem
to  yield consistent results.

Expected Results:  It seems to me that the array should be taken
offline, preventing further corruption of the underlying data.  Also,
an option to panic the box in this case may also be a nice thing to
prevent further data loss, since IMHO, data loss from multi drive
failures can many times be avoided if the system just stops when it
happens.

Additional info:

Comment 1 Jim Paradis 2005-12-02 22:50:50 UTC
This issue is beyond the scope of the current support status of RHEL 2.1.  No
fix is planned.