54021 – aacraid module generates errors/won't read partition table

Bug 54021 - aacraid module generates errors/won't read partition table

Summary: aacraid module generates errors/won't read partition table

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Raw Hide
Classification:	Retired
Component:	kernel
Sub Component:
Version:	1.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-09-25 16:33 UTC by kevin_myer
Modified:	2007-04-18 16:37 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2001-09-25 16:33:30 UTC
Embargoed:

Attachments	(Terms of Use)

Description kevin_myer 2001-09-25 16:33:26 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20010913

Description of problem:
I attempted to boot kernel-enterprise-2.4.9-0.5 on a Dell PowerEdge 4400
server.  This server has a PERC3/Di embedded RAID controller with two
containers:  a RAID1 9Gb and a RAID5 100Gb.

When I booted the kernel, it correctly detected the PERC RAID controller. 
However, when it attempted to do a partition check, I got:

sda: <1> AAC:   NMI ISR: NMI_DMA_0_ERROR

After a few seconds, I'm presented with a seemingly never-ending cycle of
SCSI timeouts.  The kernel doesn't panic, however, as a CTL-ALT-DEL halts
the system and cleanly reboots it.

I also tried the 2.4.9-0.5smp kernel and had the same results.  I noticed
there seemed to be some new NMI code in the 2.4.9-ac10; I don't know if
that has anything to do with it or not.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Power on
2.Boot 2.4.9-0.5 kernel
3.
	

Actual Results:  Hangs when trying to read partition of first PERC device
(/dev/sda)

Expected Results:  System should boot.

Additional info:

Comment 1 Arjan van de Ven 2001-09-26 08:54:50 UTC

I've seen this on another driver as well, and the cause has been found recently.
The change to make scsi drivers (only the ones that have the ability and are
tested) use "high memory" directly instead of needing bouncebuffers had a small
(but of 3 drivers significant) sideeffect in how some normal requests were
handled; this exposed bugs in codepaths that before never were executed.

This change of behavior has been corrected in kernel 2.4.9-0.12 (and later),
which hopefully will appear in rawhide soon.

Thank you for the report; this means that I'll have to check this driver for the
bug (eventhough that codepath isn't executed).

Comment 2 Matt Domsch 2001-09-26 19:14:04 UTC

Is this the "use sg only for >1 chunk" bug?

Comment 3 kevin_myer 2001-10-11 11:33:26 UTC

This bug should be reopened.  It is _NOT_ fixed in kernel 2.4.9-0.18.  I am able
to get further along in the boot process than before in that the partition table
is now readable but shortly thereafter I get the same message as before:

AAC:   NMI ISR: NMI_DMA_0_ERROR

followed by a pause of 20-30 seconds, then endless SCSI errors.  The three
fingered salute brings the machine down cleanly.

Note You need to log in before you can comment on or make changes to this bug.