Red Hat Bugzilla – Bug 445073
Kernel looses secondary channels of Promise PATA Controller
Last modified: 2008-05-06 07:20:19 EDT
Description of problem:
Software RAID partition simultaneously lost two discs in a RAID5 3+1
configuration. All four discs are identical Western Digital 500GB Parallel
ATA (aka IDE discs), each one is on single device (master) connection to a
Promise Ultra ATA PCI dual channel 100TX or 133TX controller. There are two
such controllers in the system (which also has an on-board 278 SATA controller,
a 278 SATA PCI controller and an Adaptec AHA-2940 SCSI controller).
The RAID set was built from sdb (PCI PATA controller 1, channel 0), sdc (PCI
PATA controller (PCI PATA controller 1, channel 1), sdd (PCI PATA controller 2,
channel 0) and sde (PCI PATA controller 2, channel 1). The failure was caused
by the simultaneous loss of both sdc and sde - two discs on completely different
controllers, but in both cases the channel 1 (second) channel of each
controller. This seems to me significant, hence this bug report.
Version-Release number of selected component (if applicable):
Only identifiably happened once. However, server has become very unstable of
late (probably since this kernel was installed) with a number of spontaneous
crashes (system freeze, no crash data recovered) when attempting heavy IO on the
RAID device affected by this failure. Last such crash was approx 20:30 last
night, prior to this logged failure at 04:16 this morning.
Steps to Reproduce:
System looses access to Software RAID partition or locks up solid.
System operates as normal.
MDADM notification of 04:16:44 this morning:
A Fail event had been detected on md device /dev/md2.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid6] [raid5] [raid4]
md2 : active raid5 sdb1 sde1(F) sdd1 sdc1(F)
1465151808 blocks level 5, 64k chunk, algorithm 2 [4/2] [U_U_]
md0 : active raid5 sdf1 sdi1 sdh1 sdg1
879100608 blocks level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>
Created attachment 304444 [details]
/var/log/messages from the period of the problem
Upgrading Kernel to 220.127.116.11-85.fc8 before attempting to recover software RAID.
The trace shows the drive going busy and never coming back. At that point it
jams the entire channel so both disks on the channel will be lost.
Linux is correctly trying to recover by resetting the device but to no effect.
At first glance that looks like a failing drive.
Understood - except this happens almost simultaneously with two different
controllers and in both cases the second channel dies. This is why I reported
it as a bug rather than just figuring it was a hardware fault.
Update: since going to the 18.104.22.168-85.fc8 kernel, plus disabling non-used
devices in the BIOS (USB controller, Serial Ports, AC97 Audio card, IEEE1394
controller) to free IRQs, and reseating all the PCI boards, I've had the system
under fairly heavy load for 48 hours with not a single reported disc error or
problem. The area of previous heavy load was under similar stress as before,
but with no adverse effects. No components have been changed. When brought
back on-line, no disc errors were reported by e2fsck -f. There was no
corruption found to file system or files (several hundred in the affected areas
Ok I'll close this bug for now, if it does it again please re-open the bug and
we can dig deeper.