101337 – 16-way x440 still has broken IRQ routing in the installer

Bug 101337 - 16-way x440 still has broken IRQ routing in the installer

Summary: 16-way x440 still has broken IRQ routing in the installer

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-07-30 22:03 UTC by James Cleverdon
Modified:	2007-11-30 22:06 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-01-21 11:25:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description James Cleverdon 2003-07-30 22:03:02 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225

Description of problem:
Update to bug # 99362, which was on alpha4, for beta1.

Both the installer and installed kernels have an "Unknown interrupt" message and
a stream of SCSI timeout messages when installed on a 16-way x440.

One error message sequence is produced for each SCSI device _not_ connected to
the Adaptec 7899 controllers.  On a 16-way x440, this is 4 SCSI Wide busses, so
the timeout/error process takes about 15 minutes to complete.

Version-Release number of selected component (if applicable):
kernel-2.4.21-1.1931.2.349.2.2.ent

How reproducible:
Always

Steps to Reproduce:
1. Install on 16-way x440 or reboot after such an install.
2. Get a cup of tea and watch the errors slowly march across the console.
3. Zzzzzz ... eh?  What?  Is it done yet?
    

Actual Results:  Amazingly, after the above sequence of errors, the system
booted normally.

Expected Results:  It should have booted without errors, and the SCSI probe time
should have been around 30 seconds.

Additional info:

Here's a small exerpt of the errors, which overflowed log_buf. Must recompile
with a larger one:


scsi2:0:10:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi2:0:10:0: Command already completed
aic7xxx_abort returns 0x2002
scsi2:0:10:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi2:0:10:0: Command already completed
aic7xxx_abort returns 0x2002
scsi: device set offline - not ready or command retry failed after bus reset:
host 2 channel 0 id 10 lun 0
scsi2:0:11:0: Attempting to queue an ABORT message
CDB: 0x12 0x0 0x0 0x0 0xff 0x0
scsi2:0:11:0: Command already completed
aic7xxx_abort returns 0x2002
scsi2:0:11:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi2:0:11:0: Command already completed
aic7xxx_abort returns 0x2002
scsi2:0:11:0: Attempting to queue a TARGET RESET message
CDB: 0x12 0x0 0x0 0x0 0xff 0x0
scsi2:0:11:0: Is not an active device
aic7xxx_dev_reset returns 0x2002
scsi2:0:11:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi2:0:11:0: Command already completed
aic7xxx_abort returns 0x2002
scsi2:0:11:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi2:0:11:0: Command already completed
aic7xxx_abort returns 0x2002
scsi: device set offline - not ready or command retry failed after bus reset:
host 2 channel 0 id 11 lun 0

Comment 1 Arjan van de Ven 2003-07-30 22:05:14 UTC

this bug is a duplicate of another bug about this same issue.

Comment 2 Arjan van de Ven 2003-07-30 22:09:37 UTC

one question for the IBM folks is if they are sure the $PIR table is correct in
the bios; that seems to be the common cause for problems like this.

Comment 3 Michael K. Johnson 2003-08-06 14:05:01 UTC

We need a confirmation in regard to the $PIR table.
We have precisely one boot kernel.  It has to install on a wide
range of hardware, thus we rely on the $PIR table being intact
and precisely correct.

For the 440GX chipset, this was actually not the case, but after
a few years, a workaround was finally discovered, encoded as the
broken_pirq quirk.  Hack it into arch/i386/kernel/dmi_scan.c and
see if it works -- if it works, we're home free.

However, we can't really modify that quirk because it is really
to late to test the effect on other hardware, so if it doesn't
work, you'll need to code up a new quirk that fixes the problem.
Well, that or do a BIOS update that presents a working $PIR table
for a uniprocessor kernel to use.

Comment 4 James Cleverdon 2003-08-06 20:51:24 UTC

Hmmm...  The same problem happens with the installed kernel and it's not even
hitting the printks I put into the pirq_find_routing_table function.

It looks like the $PIR table isn't getting parsed at all before the system dies
with the sibling table bug.

Comment 5 Michael K. Johnson 2003-08-07 01:16:13 UTC

Which kernel were you working with?  I thought Sushi has a kernel with
the sibling table bug fixed...

sibling table bug won't affect the BOOT kernel, since that's uniprocessor.

We could *easily* have more than one bug here; anything that breaks interrupt
routing in any way can do the same thing.  So start with the BOOT kernel and
investigate the $PIR table there.

Then for anything that the sibling table bug is blocking, make sure you have
a recent kernel with it fixed -- changelog will have
- correct HT cpu pair detection (Arjan van de Ven)

Comment 6 Doug Ledford 2005-01-21 11:25:38 UTC

Closing this bug out as we are tracking this issue on a different bug.

Note You need to log in before you can comment on or make changes to this bug.