From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030314
Description of problem:
This weekend we experienced a multi-drive failure on a raid 5 array
that was likely due to temperature issues. The /proc/mdstat output
for the array is:
md8 : active raid5 sdl1 sdk1 sdj1 sdi1 sdh1 sdg1
sdf1 sde1 sdc1 sdb1 sda1
104903040 blocks level 5, 64k chunk, algorithm 0 [11/11]
Of course, that is the current view, after the issues were resolved.
What I saw in /var/log/messages was a gradual degradation of this
array. I grep'd the failures out and you can see them here:
Jan 17 05:57:16 redline kernel: raid5: Disk failure on sda1, disabling
device. Operation continuing on 10 devices
Jan 17 06:03:45 redline kernel: raid5: Disk failure on sdc1, disabling
device. Operation continuing on 9 devices
Jan 17 06:05:17 redline kernel: raid5: Disk failure on sdb1, disabling
device. Operation continuing on 8 devices
Jan 17 06:06:49 redline kernel: raid5: Disk failure on sdf1, disabling
device. Operation continuing on 7 devices
Jan 17 06:11:25 redline kernel: raid5: Disk failure on sdg1, disabling
device. Operation continuing on 6 devices
Anything less than 10 drives for this array should not work, but I
still get these operation continuing messages. When this machine came
back up, the use of mkraid brought the array back, but the underlying
filesystem was severely corrupted, and I wound up doing a restore.
I should note here that md9 on this host is of the same configuration,
and failed in the same way, but no corruption resulted after I brought
the array back and checked the fs.
It seems that if a RAID 5 array falls to having less disks than the
minimum required, the device should deactivate itself to protect the
data that is on it in a consistent manner.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Have a multi drive failure on a RAID 5 device.
Observe that "operation continues", and failures like this don't seem
to yield consistent results.
Expected Results: It seems to me that the array should be taken
offline, preventing further corruption of the underlying data. Also,
an option to panic the box in this case may also be a nice thing to
prevent further data loss, since IMHO, data loss from multi drive
failures can many times be avoided if the system just stops when it
This issue is beyond the scope of the current support status of RHEL 2.1. No
fix is planned.