From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030314 Description of problem: This weekend we experienced a multi-drive failure on a raid 5 array that was likely due to temperature issues. The /proc/mdstat output for the array is: md8 : active raid5 sdl1[8] sdk1[10] sdj1[9] sdi1[3] sdh1[7] sdg1[6] sdf1[5] sde1[4] sdc1[2] sdb1[1] sda1[0] 104903040 blocks level 5, 64k chunk, algorithm 0 [11/11] [UUUUUUUUUUU] Of course, that is the current view, after the issues were resolved. What I saw in /var/log/messages was a gradual degradation of this array. I grep'd the failures out and you can see them here: Jan 17 05:57:16 redline kernel: raid5: Disk failure on sda1, disabling device. Operation continuing on 10 devices Jan 17 06:03:45 redline kernel: raid5: Disk failure on sdc1, disabling device. Operation continuing on 9 devices Jan 17 06:05:17 redline kernel: raid5: Disk failure on sdb1, disabling device. Operation continuing on 8 devices Jan 17 06:06:49 redline kernel: raid5: Disk failure on sdf1, disabling device. Operation continuing on 7 devices Jan 17 06:11:25 redline kernel: raid5: Disk failure on sdg1, disabling device. Operation continuing on 6 devices Anything less than 10 drives for this array should not work, but I still get these operation continuing messages. When this machine came back up, the use of mkraid brought the array back, but the underlying filesystem was severely corrupted, and I wound up doing a restore. I should note here that md9 on this host is of the same configuration, and failed in the same way, but no corruption resulted after I brought the array back and checked the fs. It seems that if a RAID 5 array falls to having less disks than the minimum required, the device should deactivate itself to protect the data that is on it in a consistent manner. Version-Release number of selected component (if applicable): How reproducible: Didn't try Steps to Reproduce: Have a multi drive failure on a RAID 5 device. Observe that "operation continues", and failures like this don't seem to yield consistent results. Expected Results: It seems to me that the array should be taken offline, preventing further corruption of the underlying data. Also, an option to panic the box in this case may also be a nice thing to prevent further data loss, since IMHO, data loss from multi drive failures can many times be avoided if the system just stops when it happens. Additional info:
This issue is beyond the scope of the current support status of RHEL 2.1. No fix is planned.