Description of problem: Software RAID partition simultaneously lost two discs in a RAID5 3+1 configuration. All four discs are identical Western Digital 500GB Parallel ATA (aka IDE discs), each one is on single device (master) connection to a Promise Ultra ATA PCI dual channel 100TX or 133TX controller. There are two such controllers in the system (which also has an on-board 278 SATA controller, a 278 SATA PCI controller and an Adaptec AHA-2940 SCSI controller). The RAID set was built from sdb (PCI PATA controller 1, channel 0), sdc (PCI PATA controller (PCI PATA controller 1, channel 1), sdd (PCI PATA controller 2, channel 0) and sde (PCI PATA controller 2, channel 1). The failure was caused by the simultaneous loss of both sdc and sde - two discs on completely different controllers, but in both cases the channel 1 (second) channel of each controller. This seems to me significant, hence this bug report. Version-Release number of selected component (if applicable): 2.6.24.4-64.fc8 How reproducible: Only identifiably happened once. However, server has become very unstable of late (probably since this kernel was installed) with a number of spontaneous crashes (system freeze, no crash data recovered) when attempting heavy IO on the RAID device affected by this failure. Last such crash was approx 20:30 last night, prior to this logged failure at 04:16 this morning. Steps to Reproduce: Unknown. Actual results: System looses access to Software RAID partition or locks up solid. Expected results: System operates as normal. Additional info:
MDADM notification of 04:16:44 this morning: -------------------------------------------- A Fail event had been detected on md device /dev/md2. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid6] [raid5] [raid4] md2 : active raid5 sdb1[0] sde1[4](F) sdd1[2] sdc1[5](F) 1465151808 blocks level 5, 64k chunk, algorithm 2 [4/2] [U_U_] md0 : active raid5 sdf1[0] sdi1[3] sdh1[2] sdg1[1] 879100608 blocks level 5, 32k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none>
Created attachment 304444 [details] /var/log/messages from the period of the problem
Upgrading Kernel to 2.6.24.5-85.fc8 before attempting to recover software RAID.
The trace shows the drive going busy and never coming back. At that point it jams the entire channel so both disks on the channel will be lost. Linux is correctly trying to recover by resetting the device but to no effect. At first glance that looks like a failing drive.
Understood - except this happens almost simultaneously with two different controllers and in both cases the second channel dies. This is why I reported it as a bug rather than just figuring it was a hardware fault. Update: since going to the 2.6.24.5-85.fc8 kernel, plus disabling non-used devices in the BIOS (USB controller, Serial Ports, AC97 Audio card, IEEE1394 controller) to free IRQs, and reseating all the PCI boards, I've had the system under fairly heavy load for 48 hours with not a single reported disc error or problem. The area of previous heavy load was under similar stress as before, but with no adverse effects. No components have been changed. When brought back on-line, no disc errors were reported by e2fsck -f. There was no corruption found to file system or files (several hundred in the affected areas were checked). Regards, Bevis.
Ok I'll close this bug for now, if it does it again please re-open the bug and we can dig deeper.