Red Hat Bugzilla – Bug 89593
sym53c8xx hang during installer module load
Last modified: 2007-11-30 17:06:55 EST
I've got a server that has been running 6.1 for years. I started upgrading it
to 8.0 when 9 came out, so now I'm trying 9.
It has built-in SCSI on the motherboard and two added Tekram SCSI cards, all
using Symbios chips. This worked fine in 6.1 and 8.0, but with 9, the installer
hangs when loading the sym53c8xx module.
Here is what it prints to the kernel message console (hand copied, since when I
do a serial console during install the kernel messages are not available):
scsi : aborting command due to timeout : pid 59, scsi2, channel 0, id 0, lun 0,
0x12 00 00 00 ff 00
sym53c8xx_abort: pid 59 serial_number=60 serial_number_at_timeout=60
SCSI host 2 abort (pid 59) timed out - resetting
SCSI bus is being reset for host 2 channel 0
sym53c8xx_reset: pid=59 reset_flags=2 serial_number=60 serial_number_at_timeout=60
It keeps doing this with an increasing pid and serial number until I reset the
Host 2 is the first Tekram card I believe.
What has changed since 6.1 and 8.0 WRT sym53c8xx that would cause this?
Okay, I tried sym53c8xx_2, and it failed as well:
sym2:0:0: ABORT operation started.
sym2:0:0: ABORT operation timed-out.
sym2:0:0: DEVICE RESET operation started.
sym2:0:0: DEVICE RESET operation timed-out.
sym2:0:0: BUS RESET operation started.
sym2:0:0: BUS RESET operation timed-out.
sym2:0:0: HOST RESET operation started.
sym2: SCSI BUS has been reset.
and that is it. No repeated messages or anything; but at that point nothing is
happening on the system.
Suggestions for the next step?
I rebuilt the boot floppy with kernel-BOOT-2.4.20-9.i386.rpm kernel and modules
and tried both sym53c8xx and sym53c8xx_2; I got the same results as with the
I've also tried the old ncr53c8xx with the same results. However, if I build a
new bootdisk with the kernel-BOOT from RHL 8.0 updates, it boots (but I get a
traceback in anaconda when setting up LVM - I guess there is a mismatch there).
I looked at the source to the 8.0 and 9 update kernels, and the ncr53c8xx driver
is identical, and the sym53c8xx driver has a one-line change that I don't think
is affecting this. Is it possible this is a compiler bug, or is there some
other part of the kernel that could cause a permanent SCSI bus timeout?
I've created a boot floppy with the kernel/modules from the RHL 8.0 errata
kernel-BOOT-2.4.18-27.8.0.i386.rpm that has all the necessary modules on the
floppy (so no changes to the second stage image, although I had to make my
kickstart %pre section manually load raid1, lvm-mod, jbd, and ext3). With that,
I have a successful install of 9 (although it didn't reboot at the end even
though I have "reboot" in my ks.cfg).
After rebooting, it works fine. The kernel-smp-2.4.20-8.i686.rpm that was
installed works fine with no SCSI hang while scanning the bus. There is
definately something odd with kernel-BOOT-2.4.20-.i386.rpm that will
reliably cause a hang during the SCSI bus scan on this system.
This is assigned, but I don't see any action. This is a bigger problem now, as
we're looking at moving to RHEL, and the taroon beta does the same thing. If we
can't even boot the RHEL3 installer on three of our main servers, I'll have a
harder time convincing others to buy RHEL.
I've built a taroon install image with the "regular" kernel RPM instead of the
-BOOT kernel RPM; I'll give that a try tomorrow.
I've done some more testing and discovered that only the SMP kernel
works right. If (after futzing with the install image to use the BOOT
kernel from RHL 8.0) after install I try to boot the UP (but still i686)
kernel, I get the same problem when the sym53c8xx module is loaded.
Could this be interrupt related? Do the interrupts get routed or shared
different between UP and SMP kernels?
I tried to boot the latest (as of 2004-01-27 morning) Fedora
development tree installer on one of these boxes with the same result
(hang during scan of SCSI bus from PCI card).
Is anyone interested in this at all, or am I wasting my time? In 9
months, the only response I've had is email from others with the same
problem; nobody from Red Hat has even commented. At least resolve it
with WONTFIX if that is what is going to happen (and when we need to
load something new on these boxes we'll look for something other than
RHEL or FC).
Umm, why do you ask Red Hat to close it with "WONTFIX if that is what
is going to happen", but at the same time close it yourself with
CURRENTRELEASE? (I'm one of those interested in the problem, but not
knowing enough to help.)
Gaah, I was just trying to look at the drop-down list to get the
close radio button.
Hello, Chris. I apologize that this bugzilla had dropped through
the cracks. I'm reassigning this to Doug Ledford for initial
I do still have one of these systems under my desk (i.e. out of
production) that I will be happy to run any tests on (it has a test
setup of RHEL ES 3 Update 1 on the drives at the moment, but I can
blow that away too if needed).
This isn't a scsi driver bug, this is an interrupt routing issue.
What is the actual machine this is in?
It is an Intel N440BX motherboard (boxed retail board), with dual
Intel PIII 500MHz CPUs and 1G RAM. The SCSI cards are Tekram 390U2B.
I updated the BIOS on the mboard to the latest (it didn't make any
difference). Two of these systems are in RHN if you want DMI info or
anything; see gnat2 and gnat3.hiwaay.net.
The odd thing to me is that the SMP kernel works fine but the UP
kernel always fails (no matter the SCSI driver, ncr53c8xx, sym53c8xx,
Nope, not odd at all :-( See bz #29555 to see why this is happening,
and why we have been around and around with Intel trying to get docs
on these things and they won't give it to us. The basic jist of the
issue is that your motherboard has a PCI BIOS with a fake $PIRQ table
that the linux kernel thinks it can use to do PCI IRQ mapping. It
can't. The PCI IRQ mapping is only controllable via another chip, and
if we mess with the $PIRQ interrupt routing registers, it has no
effect. The smp kernel includes IOAPIC support for interrupt routing
and that works. So, smp kernels with IOAPIC IRQ routing: OK, up
kernel with only $PIRQ interrupt routing support looks to the kernel
like it should work but doesn't. We have to blacklist every system we
run across with this chipset problem using the dmidecode data so that
they will work with up kernels.
Now, it seems liken we changed that blacklist between RHEL3 release
and the latest update, but I could be wrong. So, a RHEL3 U2 based CD
install set *might* work on your machine. If it doesn't, then we need
the dmidecode data so we can blacklist your BIOS just like the ones in
bz #29555. (And although the information is in the RHN database, I
don't have access to that, so I can't dig it out for myself).
I just PXE booted the RHEL3 U2 kernel and got the same result, so I
guess mine isn't in the blacklist yet. I'll attach dmidecode output
to this ticket.
Created attachment 100434 [details]
dmidecode output from hanging system
Can you try booting the RHEL3 U2 kernel with the command line option
pci=biosirq and see if the install kernel works then?
No change - still get SCSI timeouts.
Created attachment 100512 [details]
DMI blacklist entry for this machine
This has been tested and shown to resolve the problem on this machine.
Nominating for RHEL3 U3 inclusion.
Will this patch also be passed to the standard kernel (and Fedora)?
I'm running RHEL on the affected systems, but that could change down
The patch in comment #19 has just been committed to the RHEL3 U3
patch pool this evening (in kernel version 2.4.21-15.8.EL).
An errata has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
Well, now I'm trying to install RHEL 4 on this system (RHEL 4 ES update 2
specifically), and it appears I'm hitting the same problem. The installer stops
as soon as it loads the sym53c8xx module; I get:
<6>PCI: Assigned IRQ 11 for device 0000:00:0b.0
<6>sym0: <895> rev 0x1 at pci 0000:00:0b.0 irq 11
<4>sym0: Tekram NVRAM, ID 7, Fast-40, LVD, parity checking
<5>sym0: SCSI BUS has been reset.
<6>scsi0 : sym-2.1.18j
<4>sym0:0:0: ABORT operation started.
<4>sym0:0:0: ABORT operation timed-out.
<4>sym0:0:0: DEVICE RESET operation started.
<4>sym0:0:0: DEVICE RESET operation timed-out.
<4>sym0:0:0: BUS RESET operation started.
<4>sym0:0:0: BUS RESET operation timed-out.
<4>sym0:0:0: HOST RESET operation started.
<5>sym0: SCSI BUS has been reset.
Reclosing RHEL3 bug. Please open a different bug report for RHEL4.