445073 – Kernel looses secondary channels of Promise PATA Controller

Bug 445073 - Kernel looses secondary channels of Promise PATA Controller

Summary: Kernel looses secondary channels of Promise PATA Controller

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	8
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Alan Cox
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-05-03 07:54 UTC by Bevis King
Modified:	2008-05-06 11:20 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-05-06 11:20:19 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/var/log/messages from the period of the problem (12.00 KB, text/plain) 2008-05-03 08:08 UTC, Bevis King	no flags	Details
View All

Description Bevis King 2008-05-03 07:54:39 UTC

Description of problem:
Software RAID partition simultaneously lost two discs in a RAID5 3+1
configuration.  All four discs are identical Western Digital 500GB Parallel
ATA (aka IDE discs), each one is on single device (master) connection to a
Promise Ultra ATA  PCI dual channel 100TX or 133TX controller.  There are two
such controllers in the system (which also has an on-board 278 SATA controller,
a 278 SATA PCI controller and an Adaptec AHA-2940 SCSI controller).

The RAID set was built from sdb (PCI PATA controller 1, channel 0), sdc (PCI
PATA controller (PCI PATA controller 1, channel 1), sdd (PCI PATA controller 2,
channel 0) and sde (PCI PATA controller 2, channel 1).  The failure was caused
by the simultaneous loss of both sdc and sde - two discs on completely different
controllers, but in both cases the channel 1 (second) channel of each
controller.  This seems to me significant, hence this bug report.

Version-Release number of selected component (if applicable):
2.6.24.4-64.fc8

How reproducible:
Only identifiably happened once.  However, server has become very unstable of
late (probably since this kernel was installed) with a number of spontaneous
crashes (system freeze, no crash data recovered) when attempting heavy IO on the
RAID device affected by this failure.  Last such crash was approx 20:30 last
night, prior to this logged failure at 04:16 this morning.

Steps to Reproduce:
Unknown.
  
Actual results:
System looses access to Software RAID partition or locks up solid.

Expected results:
System operates as normal.

Additional info:

Comment 1 Bevis King 2008-05-03 07:57:32 UTC

MDADM notification of 04:16:44 this morning:
--------------------------------------------
A Fail event had been detected on md device /dev/md2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid6] [raid5] [raid4] 
md2 : active raid5 sdb1[0] sde1[4](F) sdd1[2] sdc1[5](F)
      1465151808 blocks level 5, 64k chunk, algorithm 2 [4/2] [U_U_]
      
md0 : active raid5 sdf1[0] sdi1[3] sdh1[2] sdg1[1]
      879100608 blocks level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
      
unused devices: <none>

Comment 2 Bevis King 2008-05-03 08:08:32 UTC

Created attachment 304444 [details]
/var/log/messages from the period of the problem

Comment 3 Bevis King 2008-05-03 08:28:38 UTC

Upgrading Kernel to 2.6.24.5-85.fc8 before attempting to recover software RAID.

Comment 4 Alan Cox 2008-05-03 14:24:39 UTC

The trace shows the drive going busy and never coming back. At that point it
jams the entire channel so both disks on the channel will be lost.

Linux is correctly trying to recover by resetting the device but to no effect.

At first glance that looks like a failing drive.

Comment 5 Bevis King 2008-05-06 11:08:42 UTC

Understood - except this happens almost simultaneously with two different
controllers and in both cases the second channel dies.  This is why I reported
it as a bug rather than just figuring it was a hardware fault.

Update:  since going to the 2.6.24.5-85.fc8 kernel, plus disabling non-used
devices in the BIOS (USB controller, Serial Ports, AC97 Audio card, IEEE1394
controller) to free IRQs, and reseating all the PCI boards, I've had the system
under fairly heavy load for 48 hours with not a single reported disc error or
problem.  The area of previous heavy load was under similar stress as before,
but with no adverse effects.  No components have been changed.  When brought
back on-line, no disc errors were reported by e2fsck -f.  There was no
corruption found to file system or files (several hundred in the affected areas
were checked).

Regards, Bevis.

Comment 6 Alan Cox 2008-05-06 11:20:19 UTC

Ok I'll close this bug for now, if it does it again please re-open the bug and
we can dig deeper.

Note You need to log in before you can comment on or make changes to this bug.