Bug 445073 - Kernel looses secondary channels of Promise PATA Controller
Kernel looses secondary channels of Promise PATA Controller
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Alan Cox
Fedora Extras Quality Assurance
Depends On:
  Show dependency treegraph
Reported: 2008-05-03 03:54 EDT by Bevis King
Modified: 2008-05-06 07:20 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-05-06 07:20:19 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
/var/log/messages from the period of the problem (12.00 KB, text/plain)
2008-05-03 04:08 EDT, Bevis King
no flags Details

  None (edit)
Description Bevis King 2008-05-03 03:54:39 EDT
Description of problem:
Software RAID partition simultaneously lost two discs in a RAID5 3+1
configuration.  All four discs are identical Western Digital 500GB Parallel
ATA (aka IDE discs), each one is on single device (master) connection to a
Promise Ultra ATA  PCI dual channel 100TX or 133TX controller.  There are two
such controllers in the system (which also has an on-board 278 SATA controller,
a 278 SATA PCI controller and an Adaptec AHA-2940 SCSI controller).

The RAID set was built from sdb (PCI PATA controller 1, channel 0), sdc (PCI
PATA controller (PCI PATA controller 1, channel 1), sdd (PCI PATA controller 2,
channel 0) and sde (PCI PATA controller 2, channel 1).  The failure was caused
by the simultaneous loss of both sdc and sde - two discs on completely different
controllers, but in both cases the channel 1 (second) channel of each
controller.  This seems to me significant, hence this bug report.

Version-Release number of selected component (if applicable):

How reproducible:
Only identifiably happened once.  However, server has become very unstable of
late (probably since this kernel was installed) with a number of spontaneous
crashes (system freeze, no crash data recovered) when attempting heavy IO on the
RAID device affected by this failure.  Last such crash was approx 20:30 last
night, prior to this logged failure at 04:16 this morning.

Steps to Reproduce:
Actual results:
System looses access to Software RAID partition or locks up solid.

Expected results:
System operates as normal.

Additional info:
Comment 1 Bevis King 2008-05-03 03:57:32 EDT
MDADM notification of 04:16:44 this morning:
A Fail event had been detected on md device /dev/md2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid6] [raid5] [raid4] 
md2 : active raid5 sdb1[0] sde1[4](F) sdd1[2] sdc1[5](F)
      1465151808 blocks level 5, 64k chunk, algorithm 2 [4/2] [U_U_]
md0 : active raid5 sdf1[0] sdi1[3] sdh1[2] sdg1[1]
      879100608 blocks level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>
Comment 2 Bevis King 2008-05-03 04:08:32 EDT
Created attachment 304444 [details]
/var/log/messages from the period of the problem
Comment 3 Bevis King 2008-05-03 04:28:38 EDT
Upgrading Kernel to before attempting to recover software RAID.
Comment 4 Alan Cox 2008-05-03 10:24:39 EDT
The trace shows the drive going busy and never coming back. At that point it
jams the entire channel so both disks on the channel will be lost.

Linux is correctly trying to recover by resetting the device but to no effect.

At first glance that looks like a failing drive.
Comment 5 Bevis King 2008-05-06 07:08:42 EDT
Understood - except this happens almost simultaneously with two different
controllers and in both cases the second channel dies.  This is why I reported
it as a bug rather than just figuring it was a hardware fault.

Update:  since going to the kernel, plus disabling non-used
devices in the BIOS (USB controller, Serial Ports, AC97 Audio card, IEEE1394
controller) to free IRQs, and reseating all the PCI boards, I've had the system
under fairly heavy load for 48 hours with not a single reported disc error or
problem.  The area of previous heavy load was under similar stress as before,
but with no adverse effects.  No components have been changed.  When brought
back on-line, no disc errors were reported by e2fsck -f.  There was no
corruption found to file system or files (several hundred in the affected areas
were checked).

Regards, Bevis.
Comment 6 Alan Cox 2008-05-06 07:20:19 EDT
Ok I'll close this bug for now, if it does it again please re-open the bug and
we can dig deeper.

Note You need to log in before you can comment on or make changes to this bug.