I've got a server that has been running 6.1 for years. I started upgrading it to 8.0 when 9 came out, so now I'm trying 9. It has built-in SCSI on the motherboard and two added Tekram SCSI cards, all using Symbios chips. This worked fine in 6.1 and 8.0, but with 9, the installer hangs when loading the sym53c8xx module. Here is what it prints to the kernel message console (hand copied, since when I do a serial console during install the kernel messages are not available): scsi : aborting command due to timeout : pid 59, scsi2, channel 0, id 0, lun 0, 0x12 00 00 00 ff 00 sym53c8xx_abort: pid 59 serial_number=60 serial_number_at_timeout=60 SCSI host 2 abort (pid 59) timed out - resetting SCSI bus is being reset for host 2 channel 0 sym53c8xx_reset: pid=59 reset_flags=2 serial_number=60 serial_number_at_timeout=60 It keeps doing this with an increasing pid and serial number until I reset the system. Host 2 is the first Tekram card I believe. What has changed since 6.1 and 8.0 WRT sym53c8xx that would cause this?
Okay, I tried sym53c8xx_2, and it failed as well: sym2:0:0: ABORT operation started. sym2:0:0: ABORT operation timed-out. sym2:0:0: DEVICE RESET operation started. sym2:0:0: DEVICE RESET operation timed-out. sym2:0:0: BUS RESET operation started. sym2:0:0: BUS RESET operation timed-out. sym2:0:0: HOST RESET operation started. sym2: SCSI BUS has been reset. and that is it. No repeated messages or anything; but at that point nothing is happening on the system. Suggestions for the next step?
I rebuilt the boot floppy with kernel-BOOT-2.4.20-9.i386.rpm kernel and modules and tried both sym53c8xx and sym53c8xx_2; I got the same results as with the 2.4.20-8 kernel/modules.
I've also tried the old ncr53c8xx with the same results. However, if I build a new bootdisk with the kernel-BOOT from RHL 8.0 updates, it boots (but I get a traceback in anaconda when setting up LVM - I guess there is a mismatch there). I looked at the source to the 8.0 and 9 update kernels, and the ncr53c8xx driver is identical, and the sym53c8xx driver has a one-line change that I don't think is affecting this. Is it possible this is a compiler bug, or is there some other part of the kernel that could cause a permanent SCSI bus timeout?
I've created a boot floppy with the kernel/modules from the RHL 8.0 errata kernel-BOOT-2.4.18-27.8.0.i386.rpm that has all the necessary modules on the floppy (so no changes to the second stage image, although I had to make my kickstart %pre section manually load raid1, lvm-mod, jbd, and ext3). With that, I have a successful install of 9 (although it didn't reboot at the end even though I have "reboot" in my ks.cfg). After rebooting, it works fine. The kernel-smp-2.4.20-8.i686.rpm that was installed works fine with no SCSI hang while scanning the bus. There is definately something odd with kernel-BOOT-2.4.20-[89].i386.rpm that will reliably cause a hang during the SCSI bus scan on this system.
This is assigned, but I don't see any action. This is a bigger problem now, as we're looking at moving to RHEL, and the taroon beta does the same thing. If we can't even boot the RHEL3 installer on three of our main servers, I'll have a harder time convincing others to buy RHEL. I've built a taroon install image with the "regular" kernel RPM instead of the -BOOT kernel RPM; I'll give that a try tomorrow.
I've done some more testing and discovered that only the SMP kernel works right. If (after futzing with the install image to use the BOOT kernel from RHL 8.0) after install I try to boot the UP (but still i686) kernel, I get the same problem when the sym53c8xx module is loaded. Could this be interrupt related? Do the interrupts get routed or shared different between UP and SMP kernels?
I tried to boot the latest (as of 2004-01-27 morning) Fedora development tree installer on one of these boxes with the same result (hang during scan of SCSI bus from PCI card). Is anyone interested in this at all, or am I wasting my time? In 9 months, the only response I've had is email from others with the same problem; nobody from Red Hat has even commented. At least resolve it with WONTFIX if that is what is going to happen (and when we need to load something new on these boxes we'll look for something other than RHEL or FC).
Umm, why do you ask Red Hat to close it with "WONTFIX if that is what is going to happen", but at the same time close it yourself with CURRENTRELEASE? (I'm one of those interested in the problem, but not knowing enough to help.)
Gaah, I was just trying to look at the drop-down list to get the status names; I didn't notice there was JavaScript to auto-select the close radio button.
Hello, Chris. I apologize that this bugzilla had dropped through the cracks. I'm reassigning this to Doug Ledford for initial investigation. -ernie
I do still have one of these systems under my desk (i.e. out of production) that I will be happy to run any tests on (it has a test setup of RHEL ES 3 Update 1 on the drives at the moment, but I can blow that away too if needed).
This isn't a scsi driver bug, this is an interrupt routing issue. What is the actual machine this is in?
It is an Intel N440BX motherboard (boxed retail board), with dual Intel PIII 500MHz CPUs and 1G RAM. The SCSI cards are Tekram 390U2B. I updated the BIOS on the mboard to the latest (it didn't make any difference). Two of these systems are in RHN if you want DMI info or anything; see gnat2 and gnat3.hiwaay.net. The odd thing to me is that the SMP kernel works fine but the UP kernel always fails (no matter the SCSI driver, ncr53c8xx, sym53c8xx, or sym53c8xx_2).
Nope, not odd at all :-( See bz #29555 to see why this is happening, and why we have been around and around with Intel trying to get docs on these things and they won't give it to us. The basic jist of the issue is that your motherboard has a PCI BIOS with a fake $PIRQ table that the linux kernel thinks it can use to do PCI IRQ mapping. It can't. The PCI IRQ mapping is only controllable via another chip, and if we mess with the $PIRQ interrupt routing registers, it has no effect. The smp kernel includes IOAPIC support for interrupt routing and that works. So, smp kernels with IOAPIC IRQ routing: OK, up kernel with only $PIRQ interrupt routing support looks to the kernel like it should work but doesn't. We have to blacklist every system we run across with this chipset problem using the dmidecode data so that they will work with up kernels. Now, it seems liken we changed that blacklist between RHEL3 release and the latest update, but I could be wrong. So, a RHEL3 U2 based CD install set *might* work on your machine. If it doesn't, then we need the dmidecode data so we can blacklist your BIOS just like the ones in bz #29555. (And although the information is in the RHN database, I don't have access to that, so I can't dig it out for myself).
I just PXE booted the RHEL3 U2 kernel and got the same result, so I guess mine isn't in the blacklist yet. I'll attach dmidecode output to this ticket.
Created attachment 100434 [details] dmidecode output from hanging system
Can you try booting the RHEL3 U2 kernel with the command line option pci=biosirq and see if the install kernel works then?
No change - still get SCSI timeouts.
Created attachment 100512 [details] DMI blacklist entry for this machine This has been tested and shown to resolve the problem on this machine. Nominating for RHEL3 U3 inclusion.
Will this patch also be passed to the standard kernel (and Fedora)? I'm running RHEL on the affected systems, but that could change down the road.
The patch in comment #19 has just been committed to the RHEL3 U3 patch pool this evening (in kernel version 2.4.21-15.8.EL).
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-433.html
Well, now I'm trying to install RHEL 4 on this system (RHEL 4 ES update 2 specifically), and it appears I'm hitting the same problem. The installer stops as soon as it loads the sym53c8xx module; I get: <6>PCI: Assigned IRQ 11 for device 0000:00:0b.0 <6>sym0: <895> rev 0x1 at pci 0000:00:0b.0 irq 11 <4>sym0: Tekram NVRAM, ID 7, Fast-40, LVD, parity checking <5>sym0: SCSI BUS has been reset. <6>scsi0 : sym-2.1.18j <4>sym0:0:0: ABORT operation started. <4>sym0:0:0: ABORT operation timed-out. <4>sym0:0:0: DEVICE RESET operation started. <4>sym0:0:0: DEVICE RESET operation timed-out. <4>sym0:0:0: BUS RESET operation started. <4>sym0:0:0: BUS RESET operation timed-out. <4>sym0:0:0: HOST RESET operation started. <5>sym0: SCSI BUS has been reset.
Reclosing RHEL3 bug. Please open a different bug report for RHEL4.