89593 – sym53c8xx hang during installer module load

Bug 89593 - sym53c8xx hang during installer module load

Summary: sym53c8xx hang during installer module load

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-04-24 18:49 UTC by Chris Adams
Modified:	2007-11-30 22:06 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-11-22 00:30:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
dmidecode output from hanging system (11.89 KB, text/plain) 2004-05-21 20:18 UTC, Chris Adams	no flags	Details
DMI blacklist entry for this machine (1.22 KB, patch) 2004-05-24 17:14 UTC, Doug Ledford	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:433	0	normal	SHIPPED_LIVE	Updated kernel packages available for Red Hat Enterprise Linux 3 Update 3	2004-09-02 04:00:00 UTC

Description Chris Adams 2003-04-24 18:49:39 UTC

I've got a server that has been running 6.1 for years.  I started upgrading it
to 8.0 when 9 came out, so now I'm trying 9.

It has built-in SCSI on the motherboard and two added Tekram SCSI cards, all
using Symbios chips.  This worked fine in 6.1 and 8.0, but with 9, the installer
hangs when loading the sym53c8xx module.

Here is what it prints to the kernel message console (hand copied, since when I
do a serial console during install the kernel messages are not available):

scsi : aborting command due to timeout : pid 59, scsi2, channel 0, id 0, lun 0,
0x12 00 00 00 ff 00
sym53c8xx_abort: pid 59 serial_number=60 serial_number_at_timeout=60
SCSI host 2 abort (pid 59) timed out - resetting
SCSI bus is being reset for host 2 channel 0
sym53c8xx_reset: pid=59 reset_flags=2 serial_number=60 serial_number_at_timeout=60

It keeps doing this with an increasing pid and serial number until I reset the
system.

Host 2 is the first Tekram card I believe.

What has changed since 6.1 and 8.0 WRT sym53c8xx that would cause this?

Comment 1 Chris Adams 2003-04-25 21:38:53 UTC

Okay, I tried sym53c8xx_2, and it failed as well:

sym2:0:0: ABORT operation started.
sym2:0:0: ABORT operation timed-out.
sym2:0:0: DEVICE RESET operation started.
sym2:0:0: DEVICE RESET operation timed-out.
sym2:0:0: BUS RESET operation started.
sym2:0:0: BUS RESET operation timed-out.
sym2:0:0: HOST RESET operation started.
sym2: SCSI BUS has been reset.

and that is it.  No repeated messages or anything; but at that point nothing is
happening on the system.

Suggestions for the next step?

Comment 2 Chris Adams 2003-04-28 14:31:47 UTC

I rebuilt the boot floppy with kernel-BOOT-2.4.20-9.i386.rpm kernel and modules
and tried both sym53c8xx and sym53c8xx_2; I got the same results as with the
2.4.20-8 kernel/modules.

Comment 3 Chris Adams 2003-04-30 19:04:27 UTC

I've also tried the old ncr53c8xx with the same results.  However, if I build a
new bootdisk with the kernel-BOOT from RHL 8.0 updates, it boots (but I get a
traceback in anaconda when setting up LVM - I guess there is a mismatch there).

I looked at the source to the 8.0 and 9 update kernels, and the ncr53c8xx driver
is identical, and the sym53c8xx driver has a one-line change that I don't think
is affecting this.  Is it possible this is a compiler bug, or is there some
other part of the kernel that could cause a permanent SCSI bus timeout?

Comment 4 Chris Adams 2003-05-01 18:07:14 UTC

I've created a boot floppy with the kernel/modules from the RHL 8.0 errata
kernel-BOOT-2.4.18-27.8.0.i386.rpm that has all the necessary modules on the
floppy (so no changes to the second stage image, although I had to make my
kickstart %pre section manually load raid1, lvm-mod, jbd, and ext3).  With that,
I have a successful install of 9 (although it didn't reboot at the end even
though I have "reboot" in my ks.cfg).

After rebooting, it works fine.  The kernel-smp-2.4.20-8.i686.rpm that was
installed works fine with no SCSI hang while scanning the bus.  There is
definately something odd with kernel-BOOT-2.4.20-[89].i386.rpm that will
reliably cause a hang during the SCSI bus scan on this system.

Comment 5 Chris Adams 2003-10-21 00:08:11 UTC

This is assigned, but I don't see any action.  This is a bigger problem now, as
we're looking at moving to RHEL, and the taroon beta does the same thing.  If we
can't even boot the RHEL3 installer on three of our main servers, I'll have a
harder time convincing others to buy RHEL.

I've built a taroon install image with the "regular" kernel RPM instead of the
-BOOT kernel RPM; I'll give that a try tomorrow.

Comment 6 Chris Adams 2004-01-07 19:13:46 UTC

I've done some more testing and discovered that only the SMP kernel
works right.  If (after futzing with the install image to use the BOOT
kernel from RHL 8.0) after install I try to boot the UP (but still i686)
kernel, I get the same problem when the sym53c8xx module is loaded.

Could this be interrupt related?  Do the interrupts get routed or shared
different between UP and SMP kernels?

Comment 7 Chris Adams 2004-01-27 21:56:43 UTC

I tried to boot the latest (as of 2004-01-27 morning) Fedora
development tree installer on one of these boxes with the same result
(hang during scan of SCSI bus from PCI card).

Is anyone interested in this at all, or am I wasting my time?  In 9
months, the only response I've had is email from others with the same
problem; nobody from Red Hat has even commented.  At least resolve it
with WONTFIX if that is what is going to happen (and when we need to
load something new on these boxes we'll look for something other than
RHEL or FC).

Comment 8 Göran Uddeborg 2004-01-27 22:09:48 UTC

Umm, why do you ask Red Hat to close it with "WONTFIX if that is what
is going to happen", but at the same time close it yourself with
CURRENTRELEASE?  (I'm one of those interested in the problem, but not
knowing enough to help.)

Comment 9 Chris Adams 2004-01-27 22:18:06 UTC

Gaah, I was just trying to look at the drop-down list to get the
status names; I didn't notice there was JavaScript to auto-select the
close radio button.

Comment 10 Ernie Petrides 2004-04-08 22:30:29 UTC

Hello, Chris.  I apologize that this bugzilla had dropped through
the cracks.  I'm reassigning this to Doug Ledford for initial
investigation.  -ernie

Comment 11 Chris Adams 2004-04-08 23:04:45 UTC

I do still have one of these systems under my desk (i.e. out of
production) that I will be happy to run any tests on (it has a test
setup of RHEL ES 3 Update 1 on the drives at the moment, but I can
blow that away too if needed).

Comment 12 Doug Ledford 2004-05-21 17:20:38 UTC

This isn't a scsi driver bug, this is an interrupt routing issue. 
What is the actual machine this is in?

Comment 13 Chris Adams 2004-05-21 17:30:19 UTC

It is an Intel N440BX motherboard (boxed retail board), with dual
Intel PIII 500MHz CPUs and 1G RAM.  The SCSI cards are Tekram 390U2B.
 I updated the BIOS on the mboard to the latest (it didn't make any
difference).  Two of these systems are in RHN if you want DMI info or
anything; see gnat2 and gnat3.hiwaay.net.

The odd thing to me is that the SMP kernel works fine but the UP
kernel always fails (no matter the SCSI driver, ncr53c8xx, sym53c8xx,
or sym53c8xx_2).

Comment 14 Doug Ledford 2004-05-21 20:06:31 UTC

Nope, not odd at all :-(  See bz #29555 to see why this is happening,
and why we have been around and around with Intel trying to get docs
on these things and they won't give it to us.  The basic jist of the
issue is that your motherboard has a PCI BIOS with a fake $PIRQ table
that the linux kernel thinks it can use to do PCI IRQ mapping.  It
can't.  The PCI IRQ mapping is only controllable via another chip, and
if we mess with the $PIRQ interrupt routing registers, it has no
effect.  The smp kernel includes IOAPIC support for interrupt routing
and that works.  So, smp kernels with IOAPIC IRQ routing: OK, up
kernel with only $PIRQ interrupt routing support looks to the kernel
like it should work but doesn't.  We have to blacklist every system we
run across with this chipset problem using the dmidecode data so that
they will work with up kernels.

Now, it seems liken we changed that blacklist between RHEL3 release
and the latest update, but I could be wrong.  So, a RHEL3 U2 based CD
install set *might* work on your machine.  If it doesn't, then we need
the dmidecode data so we can blacklist your BIOS just like the ones in
bz #29555.  (And although the information is in the RHN database, I
don't have access to that, so I can't dig it out for myself).

Comment 15 Chris Adams 2004-05-21 20:17:46 UTC

I just PXE booted the RHEL3 U2 kernel and got the same result, so I
guess mine isn't in the blacklist yet.  I'll attach dmidecode output
to this ticket.

Comment 16 Chris Adams 2004-05-21 20:18:24 UTC

Created attachment 100434 [details]
dmidecode output from hanging system

Comment 17 Doug Ledford 2004-05-21 21:44:52 UTC

Can you try booting the RHEL3 U2 kernel with the command line option
pci=biosirq and see if the install kernel works then?

Comment 18 Chris Adams 2004-05-21 21:53:04 UTC

No change - still get SCSI timeouts.

Comment 19 Doug Ledford 2004-05-24 17:14:57 UTC

Created attachment 100512 [details]
DMI blacklist entry for this machine

This has been tested and shown to resolve the problem on this machine. 
Nominating for RHEL3 U3 inclusion.

Comment 20 Chris Adams 2004-05-25 13:20:43 UTC

Will this patch also be passed to the standard kernel (and Fedora)? 
I'm running RHEL on the affected systems, but that could change down
the road.

Comment 21 Ernie Petrides 2004-06-09 04:24:43 UTC

The patch in comment #19 has just been committed to the RHEL3 U3
patch pool this evening (in kernel version 2.4.21-15.8.EL).

Comment 22 John Flanagan 2004-09-02 04:30:37 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-433.html

Comment 23 Chris Adams 2005-11-21 19:43:12 UTC

Well, now I'm trying to install RHEL 4 on this system (RHEL 4 ES update 2
specifically), and it appears I'm hitting the same problem.  The installer stops
as soon as it loads the sym53c8xx module; I get:

<6>PCI: Assigned IRQ 11 for device 0000:00:0b.0
<6>sym0: <895> rev 0x1 at pci 0000:00:0b.0 irq 11
<4>sym0: Tekram NVRAM, ID 7, Fast-40, LVD, parity checking
<5>sym0: SCSI BUS has been reset.
<6>scsi0 : sym-2.1.18j
<4>sym0:0:0: ABORT operation started.
<4>sym0:0:0: ABORT operation timed-out.
<4>sym0:0:0: DEVICE RESET operation started.
<4>sym0:0:0: DEVICE RESET operation timed-out.
<4>sym0:0:0: BUS RESET operation started.
<4>sym0:0:0: BUS RESET operation timed-out.
<4>sym0:0:0: HOST RESET operation started.
<5>sym0: SCSI BUS has been reset.

Comment 24 Ernie Petrides 2005-11-22 00:30:30 UTC

Reclosing RHEL3 bug.  Please open a different bug report for RHEL4.

Note You need to log in before you can comment on or make changes to this bug.