Description of problem: With a fresh installation of RHEL 3.0 (final release) on an IBM xSeries 445 with four 2.5GHz Xeon CPUs, booting the SMP kernel almost always (but not quite 100% of the time) results in a failure to boot. The problem occurs when the mptscsih module is loaded. A sequence of SCSI ABORT IO messages begins and appears to continue infinitely (or at least to 77438 itterations, which was as far as I'd ever let it go). The problem does not occure with the uni-processor kernel in the same release. Neither does it occur with the SMP kernel in the beta 2 release nor with a SuSE 8.0 release, all tested on the same exact hardware. I also notice a few messages about an unexpected IO-APIC in the output (included). I do not know if they are significant. Version-Release number of selected component (if applicable): kernel-2.4.21-4.EL How reproducible: Not always but nearly so. Steps to Reproduce: 1. Boot the SMP kernel on an IBM xSeries 445 with 4 CPUs. Actual results: Boot fails. Expected results: Boot succceds Additional info: Linux version 2.4.21-4.ELsmp (bhcompile.redhat.com) (gcc version 3.2. 3 20030502 (Red Hat Linux 3.2.3-20)) #1 SMP Fri Oct 3 17:52:56 EDT 2003 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009c400 (usable) BIOS-e820: 000000000009c400 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000effa1a00 (usable) BIOS-e820: 00000000effa1a00 - 00000000effac340 (ACPI data) BIOS-e820: 00000000effac340 - 00000000f0000000 (reserved) BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000210000000 (usable) 7552MB HIGHMEM available. 896MB LOWMEM available. found SMP MP-table at 0009c540 hm, page 0009c000 reserved twice. hm, page 0009d000 reserved twice. hm, page 0009d000 reserved twice. hm, page 0009e000 reserved twice. On node 0 totalpages: 2162688 zone(0): 4096 pages. zone(1): 225280 pages. zone(2): 1933312 pages. ACPI: Searched entire block, no RSDP was found. ACPI: RSDP located at physical address c00fdfc0 RSD PTR v0 [IBM ] __va_range(0xeffac2c0, 0x68): idx=33 mapped at fffdd000 ACPI table found: RSDT v1 [IBM SERVIGIL 0.4096] __va_range(0xeffac240, 0x24): idx=33 mapped at fffdd000 __va_range(0xeffac240, 0x74): idx=33 mapped at fffdd000 ACPI table found: FACP v1 [IBM SERVIGIL 0.4096] __va_range(0xeffac180, 0x24): idx=33 mapped at fffdd000 __va_range(0xeffac180, 0x9a): idx=33 mapped at fffdd000 ACPI table found: APIC v1 [IBM SERVIGIL 0.4096] __va_range(0xeffac180, 0x9a): idx=33 mapped at fffdd000 LAPIC (acpi_id[0x0000] id[0x0] enabled[1]) CPU 0 (0x0000) enabledProcessor #0 Pentium 4(tm) XEON(tm) APIC version 20 LAPIC (acpi_id[0x0001] id[0x2] enabled[1]) CPU 1 (0x0200) enabledProcessor #2 Pentium 4(tm) XEON(tm) APIC version 20 LAPIC (acpi_id[0x0004] id[0x10] enabled[1]) CPU 2 (0x1000) enabledProcessor #16 Pentium 4(tm) XEON(tm) APIC version 20 LAPIC (acpi_id[0x0005] id[0x12] enabled[1]) CPU 3 (0x1200) enabledProcessor #18 Pentium 4(tm) XEON(tm) APIC version 20 IOAPIC (id[0xe] address[0xfec00000] glob IOAPIC (id[0xd] address[0xfec01000] glob INT_SRC_OVR (bus[0] irq[0x8] global_irq[0x8] polarity[0x3] trigger[0x1]) INT_SRC_OVR (bus[0] irq[0xe] global_irq[ INT_SRC_OVR (bus[0] irq[0xb] global_irq[ LAPIC_NMI (acpi_id[0x0000] polarity[0x0] LAPIC_NMI (acpi_id[0x0001] polarity[0x0] LAPIC_NMI (acpi_id[0x0004] polarity[0x0] trigger[0x0] lint[0x1]) LAPIC_NMI (acpi_id[0x0005] polarity[0x0] trigger[0x0] lint[0x1]) 4 CPUs total Local APIC address fee00000 __va_range(0xeffac0c0, 0x24): idx=33 mapped at fffdd000 __va_range(0xeffac0c0, 0xc0): idx=33 mapped at fffdd000 ACPI table found: SRAT v1 [IBM SERVIGIL 0.4096] __va_range(0xeffa6500, 0x24): idx=33 mapped at fffdd000 __va_range(0xeffa6500, 0x5745): idx=33 mapped at fffdd000 ACPI table found: SSDT v1 [IBM VIGSSDT0 0.4096] Enabling the CPU's according to the ACPI table Intel MultiProcessor Specification v1.4 Virtual Wire compatibility mode. OEM ID: IBM ENSW Product ID: VIGIL SMP APIC at: 0xFEE00000 I/O APIC #14 Version 17 at 0xFEC00000. I/O APIC #13 Version 17 at 0xFEC01000. Processors: 4 xAPIC support is present Enabling APIC mode: Physical. Using 2 I/O APICs IBM machine detected. Enabling interrupts during APM calls. Kernel command line: ro root=/dev/sda5 console=tty0 console=ttyS0,9600n8 Initializing CPU#0 Summit chipset: Starting Cyclone Counter. Detected 2494.930 MHz processor. Console: colour VGA+ 80x25 Calibrating delay loop... 198.65 BogoMIPS Memory: 8251588k/8650752k available (1683k kernel code, 132016k reserved, 1318k data, 224k init, 7470724k highmem) Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) Inode cache hash table entries: 524288 (order: 10, 4194304 bytes) Mount cache hash table entries: 512 (order: 0, 4096 bytes) Buffer cache hash table entries: 1048576 (order: 10, 4194304 bytes) Page-cache hash table entries: 1048576 (order: 10, 4194304 bytes) CPU: Trace cache: 12K uops, L1 D cache: 8K CPU: L2 cache: 512K CPU: L3 cache: 1024K CPU: Hyper-Threading is disabled Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX mtrr: v1.40 (20010327) Richard Gooch (rgooch.au) mtrr: detected mtrr type: Intel CPU: Trace cache: 12K uops, L1 D cache: 8K CPU: L2 cache: 512K CPU: L3 cache: 1024K CPU: Hyper-Threading is disabled Intel machine check reporting enabled on CPU#0. CPU0: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05 per-CPU timeslice cutoff: 1463.12 usecs. task migration cache decay timeout: 10 msecs. enabled ExtINT on CPU#0 Leaving ESR disabled. Booting processor 1/2 eip 2000 Initializing CPU#1 masked ExtINT on CPU#1 Leaving ESR disabled. Calibrating delay loop... 0.99 BogoMIPS CPU: Trace cache: 12K uops, L1 D cache: 8K CPU: L2 cache: 512K CPU: L3 cache: 1024K CPU: Hyper-Threading is disabled Intel machine check reporting enabled on CPU#1. CPU1: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05 Booting processor 2/16 eip 2000 Initializing CPU#2 masked ExtINT on CPU#2 Leaving ESR disabled. Calibrating delay loop... 1.01 BogoMIPS CPU: Trace cache: 12K uops, L1 D cache: 8K CPU: L2 cache: 512K CPU: L3 cache: 1024K CPU: Hyper-Threading is disabled Intel machine check reporting enabled on CPU#2. CPU2: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05 Booting processor 3/18 eip 2000 Initializing CPU#3 masked ExtINT on CPU#3 Leaving ESR disabled. Calibrating delay loop... 1.88 BogoMIPS CPU: Trace cache: 12K uops, L1 D cache: 8K CPU: L2 cache: 512K CPU: L3 cache: 1024K CPU: Hyper-Threading is disabled Intel machine check reporting enabled on CPU3: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05 Total of 4 processors activated (202.56 apic 0 pin 46 is an SMI pin! ENABLING IO-APIC IRQs Setting 14 in the phys_id_present_map ...changing IO-APIC physical APIC ID to 14 ... ok. Setting 13 in the phys_id_present_map ...changing IO-APIC physical APIC ID to 13 ... ok. ..TIMER: vector=0x31 pin1=0 pin2=-1 testing the IO APIC....................... An unexpected IO-APIC was found. If this kernel release is less than three months old please report this to linux-smp.org An unexpected IO-APIC was found. If this kernel release is less than three months old please report this to linux-smp.org .................................... done. Using local APIC timer interrupts. calibrating APIC timer ... ..... CPU clock speed is 2494.5906 MHz. ..... host bus clock speed is 99.7835 MHz. cpu: 0, clocks: 997835, slice: 199567 CPU0<T0:997824,T1:798256,D:1,S:199567,C:997835> cpu: 1, clocks: 997835, slice: 199567 cpu: 2, clocks: 997835, slice: 199567 cpu: 3, clocks: 997835, slice: 199567 CPU2<T0:997824,T1:399120,D:3,S:199567,C:997835> CPU3<T0:997824,T1:199552,D:4,S:199567,C:997835> CPU1<T0:997824,T1:598688,D:2,S:199567,C:997835> zapping low mappings. Process timing init...done. Starting migration thread for cpu 0 Starting migration thread for cpu 1 Starting migration thread for cpu 2 Starting migration thread for cpu 3 PCI: PCI BIOS revision 2.10 entry at 0xfd47d, last bus=11 PCI: Using configuration type 1 PCI: Probing PCI hardware PCI: Discovered peer bus 01 PCI: Discovered peer bus 02 PCI: Discovered peer bus 05 PCI: Discovered peer bus 07 PCI: Discovered peer bus 09 PCI->APIC IRQ transform: (B0,I3,P0) -> 39 PCI->APIC IRQ transform: (B0,I4,P0) -> 16 PCI->APIC IRQ transform: (B0,I5,P3) -> 18 PCI->APIC IRQ transform: (B0,I5,P3) -> 18 PCI->APIC IRQ transform: (B1,I3,P0) -> 40 PCI->APIC IRQ transform: (B1,I3,P1) -> 41 PCI->APIC IRQ transform: (B1,I4,P0) -> 42 PCI->APIC IRQ transform: (B1,I4,P1) -> 11 PCI->APIC IRQ transform: (B5,I4,P0) -> 71 PCI: Enabling Via external APIC routing PCI: Via IRQ fixup for 00:05.2, from 11 to 2 PCI: Via IRQ fixup for 00:05.3, from 11 to 2 isapnp: Scanning for PnP cards... isapnp: No Plug & Play device found Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket apm: BIOS not found. Total HugeTLB memory allocated, 0 Starting kswapd allocated 32 pages and 32 bhs reserved for the highmem bounces VFS: Disk quotas vdquot_6.5.1 aio_setup: num_physpages = 540672 aio_setup: sizeof(struct page) = 60 Hugetlbfs mounted. pty: 2048 Unix98 ptys configured Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT SHARE_IRQ SER IAL_PCI ISAPNP enabled ttyS0 at 0x03f8 (irq = 4) is a 16550A Real Time Clock Driver v1.10e NET4: Frame Diverter 0.46 RAMDISK driver initialized: 256 RAM disks of 8192K size 1024 blocksize Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx VP_IDE: IDE controller at PCI slot 00:05.1 VP_IDE: chipset revision 6 VP_IDE: not 100% native mode: will probe irqs later ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci00:05.1 ide0: BM-DMA at 0x0700-0x0707, BIOS settings: hda:pio, hdb:pio ide1: BM-DMA at 0x0708-0x070f, BIOS settings: hdc:pio, hdd:pio hda: MATSHITADVD-ROM SR-8177, ATAPI CD/DVD-ROM drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 ide-floppy driver 0.99.newide ide-floppy driver 0.99.newide md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. pci_hotplug: PCI Hot Plug PCI Core version: 0.5 Initializing Cryptographic API NET4: Linux TCP/IP 1.0 for NET4.0 IP: routing cache hash table of 131072 b TCP: Hash tables configured (established Linux IP multicast router 0.06 plus PIM-SM Initializing IPsec netlink socket NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. RAMDISK: Compressed image found at block 0 Freeing initrd memory: 276k freed VFS: Mounted root (ext2 filesystem). Red Hat nash verSCSI subsystem driver Revision: 1.00 sion 3.5.13 starFusion MPT base driveting Loading scr 2.05.05+ Copyright (c) 1999-2002 LSI Logic Corporation o module Loading sd_mod. Loadinmptbase: Initiating ioc0 bringup g mptbase.o module ioc0: 53C1030: Capabilities={Initiator} mptbase: Initiating ioc1 bringup ioc1: 53C1030: Capabilities={Initiator} mptbase: 2 MPT adapters found, 2 installed. Loading mptscsihFusion MPT SCSI Host driver 2.05.05+ .o module scsi0 : ioc0: LSI53C1030, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=40 scsi1 : ioc1: LSI53C1030, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=41 Starting timer : 0 0 blk: queue f678ae18, I/O limit 4294967295Mb (mask 0xffffffffffffffff) Vendor: IBM-ESXS Model: MAP3367NC FN Rev: B109 Type: Direct-Access ANSI SCSI revision: 03 Starting timer : 0 0 blk: queue f678ac18, I/O limit 4294967295Mb (mask 0xffffffffffffffff) Vendor: IBM-ESXS Model: MAP3367NC FN Rev: B109 Type: Direct-Access ANSI SCSI revision: 03 Starting timer : 0 0 blk: queue f678aa18, I/O limit 4294967295Mb (mask 0xffffffffffffffff) scsi : aborting command due to timeout : pid 2, scsi0, channel 0, id 2, lun 0 In quiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f678a800) IOs outstanding = 1 scsi : aborting command due to timeout : pid 3, scsi0, channel 0, id 3, lun 0 In quiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 2 SCSI host 0 abort (pid 3) timed out - resetting SCSI bus is being reset for host 0 channel 0. mptscsih: OldReset scheduling BUS_RESET (sc=f66ffc00) IOs outstanding = 1 scsi : aborting command due to timeout : pid 4, scsi0, channel 0, id 4, lun 0 In quiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 1 scsi : aborting command due to timeout : pid 5, scsi0, channel 0, id 5, lun 0 In quiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 2 SCSI host 0 abort (pid 5) timed out - resetting SCSI bus is being reset for host 0 channel 0. mptscsih: OldReset scheduling BUS_RESET (sc=f66ffc00) IOs outstanding = 1 mptbase: Initiating ioc0 recovery mptbase: ioc0: WARNING - Unexpected doorbell active! mptbase: ioc0: ERROR - Wait IOC_READY state timeout(1500)! mptbase: ioc0: ERROR - Failed to come READY after reset! mptbase: ioc0 NOT READY WARNING! mptbase: WARNING - (-1) Cannot recover ioc0 Firmware Reload FAILED!! scsi : aborting command due to timeout : pid 6, scsi0, channel 0, id 6, lun 0 In quiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0 scsi : aborting command due to timeout : pid 7, scsi0, channel 0, id 6, lun 0 In quiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0 scsi : aborting command due to timeout : pid 8, scsi0, channel 0, id 6, lun 0 In quiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0 scsi : aborting command due to timeout : pid 9, scsi0, channel 0, id 6, lun 0 In quiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0 scsi : aborting command due to timeout : pid 10, scsi0, channel 0, id 6, lun 0 I nquiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0 scsi : aborting command due to timeout : pid 11, scsi0, channel 0, id 6, lun 0 I nquiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IOs outstanding = 0 scsi : aborting command due to timeout : pid 12, scsi0, channel 0, id 6, lun 0 I nquiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IOs outstanding = 0 scsi : aborting command due to timeout : nquiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0 scsi : aborting command due to timeout : pid 14, scsi0, channel 0, id 6, lun 0 I nquiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0 scsi : aborting command due to timeout : pid 15, scsi0, channel 0, id 6, lun 0 I nquiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0 scsi : aborting command due to timeout : pid 16, scsi0, channel 0, id 6, lun 0 I nquiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0 scsi : aborting command due to timeout : pid 17, scsi0, channel 0, id 6, lun 0 I nquiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0 scsi : aborting command due to timeout : pid 18, scsi0, channel 0, id 6, lun 0 I nquiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0 scsi : aborting command due to timeout : pid 19, scsi0, channel 0, id 6, lun 0 I nquiry 00 00 00 ff 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00) IOs outstanding = 0
Commands are timing out during the initial probe of the bus. It seems as though scsi0, channel 0, ids 0 and 1 complete okay, but ids 2 and above fail. The only change in the mptfusion driver between beta 2 and final was that vary_io was enabled. This would not cause a problem during the initial probe of the SCSI bus. I will investigate further.
I am having the same problem installing RHAS3 on a 4way x445 with 8G RAM. I receive the same errors as the first poster mentioned. I am using kernel : 2.4.21-4.ELsmp I downloaded the source for the kernel and played with some settings to see if I could get it to load. I was able to get it to load with no scsi errors by setting CONFIG_X86_SUMMIT=n, this was the only changes I made to the kernel. This diff is between the default .config and the changed .config: 87,88c87 < CONFIG_X86_SUMMIT=y < CONFIG_X86_CLUSTERED_APIC=y --- > # CONFIG_X86_SUMMIT is not set I thought this might help in tracking down the bug. Even with CONFIG_X86_SUMMIT turned off I am able to see all 4 processors, is this supposed to happen? If you want more info I can specify greater details, just email me. Thanks for any help.
I have the same x445 with 8GB / 4CPUs and the same problem. I tried it with adding noapic to the kernel append line, but this did not fix the problem.
I have an addition to this problem. I tried installing an IBM ServeRaid 6M card, which uses the ips kernel module instead of mptsci, into the 4-way box. I was able to get it to boot to a prompt, but was not able to access the box in any way. Everytime I tried to login a SCSI I/O Timeout would flash and then the prompt would say login timed out immediately. So a different module did not fix the problem. Please note that I did reinstall the O/S onto the ServeRaid 6M hard drives and disabled the LSI cards in the BIOS.
Created attachment 96406 [details] force clustered APIC
The processor IDs on this system are 0, 2, 16, and 18. This may be causing us to use physical APIC mode when we should be using clustered APIC mode. The above patch from Ingo forces clustered APIC mode. Can you apply it and see whether it fixes your boot problem?
I did some tests today. 2.4.21-4mx.EL added this patch 2.4.21-4mx1.ELsmp edit config to # CONFIG_X86_SUMMIT is not set 2.4.21-4mx2.ELsmp added this patch and patch 110170 I was not able to boot 2.4.21-4mx.ELsmp, this caused the same problem as before. I attached the dmesgs of the different tries.
Created attachment 96425 [details] dmesg boot with 2.4.21-4mx.EL
Created attachment 96426 [details] dmesg boot with 2.4.21-4mx1.EL
Created attachment 96427 [details] dmesg boot with 2.4.21-4mx2.EL
One mistake on our side: customer told me that they have 4 CPU modules installed, I had a closer look into the hardware now and only 2 CPU modules are installed, I'm going to examine the other x445 now. But this would mean, that the fix with bug 110170 seems to work. But is the low bogomips rate ok?
I have tried both this patch and the patch from bug 110170 that Oliver has linked too, on my 4 way XEON MP 2.8GHZ IBM x445. This patch does not seem to do the trick as it still fails with SCSI I/O timeouts. The patch from bug 110170 works as far as getting the box up. I am currently testing to see if it performs properly.
I'm seeing the same scsi time out behavior on a rhl 9 machine running kernel 2.4.20-24.9smp. It's got HyperThreading enabled so it shows 4 processors. This is a dual 2.8ghz xeon in a dell poweredge 2600. I can recreate the timeouts by running tiobench on the disks. I'm going to try disabling HT and see if it has any effect on the scsi timeouts. The scsi card giving the error is: LSI Logic / Symbios Logic 53c1030 (rev 07) using: mptscsih There is also an Adaptec AHA-3960D / AIC-7899A U160/m (rev 01) using: aic7xxx in the system too. Think this is related?
I have the same problem.... xSeries 445 2way 2.5Ghz. 3 out of 4 boot attempts fail. Same APic, and SCSI errors as the original submitter. I've tried with Hiper Threading on and off and see the same behavior. Without HyPer-Threading CPU IDs are 0 & 1, With Hyper-Threading CPU ids are 0,1,16, & 17 (on CPU 0,1,2, &3) Not mentioned in earlier comments, but I am also seeing the following, when running with hyper-threading turned on.: kernel: ACPI tables and CPU MSR values mismatch about cpu number kernel: CPU: Physical Processor ID: 8
Same issue here. 445, 4x processors w/ hyperthreading enabled (8 procs therefore), 16GB memory, and an IBM serveraid card. The AS3 installer works fine, but upon reboot w/ kernel, kernel-smp, or kernel- hugemem, there are i/o errors from an mpt module, and the screen goes blank and that's it. I was able to boot the machine into single user mode once, but I think that may just have been a fluke. With AS2.1 and a x440, I need the summit kernel, which doesn't seem to exist in the AS3 distro.
I am also having the same problem on an x445.
For felicity - RHEL 3 - we no longer have the stand alone summit kernel - it is integrated.
Jason, About comment 15. To clarify the patch from bug 110170 allows your box to boot and run properly. You do not see the scsi timout errors with this patch. This looks to looks to be the same problem. It this patch fixed you problem can you attach your boot log? I would like to make sure things look ok.
Created attachment 97137 [details] dmesg output 2.4.21-4.ELcnnhugemem (contains patch from bug 110170) The patch from bug 110170 does fix the SCSI timeout errors I was seeing.
Ingo's patch (id=96406) is a good idea regardless. Summit boxes, especially the x440 and x445, need to use clustered APIC mode when they have more than 2 CPUs. The "(num_processors > FLAT_APIC_CPU_MAX)" part of the test was a bad idea, given that FLAT_APIC_CPU_MAX is defined to be 8.
I left out the force clustered APIC patch and I have been doing some testing, so far it has been less than dazzling. What does the clustered APIC give over the local APIC? I will re-patch my kernel to include the forceapic patch (id=96406) and see if that changes anything.
> "What does clustered APIC over the local APIC?" Functionality. Summit boxes only work in clustered APIC mode (unless you're using PIC emulation, which is a whole different bug ;^). Well, you can get by with flat mode if you only have 1 or 2 CPUs, but that's hardly cost effective. Clustered APIC mode changes how the APICs (both I/O and local) address interrupts. It is used for larger systems that may grow above 8 CPUs.
James and Chris at IBM, Are you in agreement with Ingo's patch and have you tested it ? Does it fix this issue ?
Bob, The original SCSI timeout issue in this bug is due to bug #110170. The patch contained there has been tested and solves a number of problems that have shown up recently. Note comment #22 for external confirmation. Please pick up that fix. I don't believe Ingo's patch solves this specific issue, however it looks to be a good idea regardless. I have not tested it myself, but will defer to James for further comments.
Same issue, IBMx445 two way 3GHz, 8GB RAM, tried AS3.0 Kernel 2.4.21-9smp from beta-channel, does not boot because of scsi error. But AS2.1 u3 works fine!
Question for Red Hat. It appears that we have a number of x445 customers running into the SCSI timeout problem. Will there be a kernel update to the RHEL3 U1 kernel (2.4.21-9smp) that will include the patch for bug 110170 discussed above? Or will these customers have to wait for the next RHEL3 update?
FYI. I just received info from a customer (Baaderbank) what IBM is saying: From: Melanie Kiehnle <KIEHNLE.com> Date: Thu, 29 Jan 2004 10:10:26 +0100 [...]"The required RHEL 3 drivers are now available and I am working with SusanMcleod on our schedule. My feeling is that we willbe prioritizing the x445fortest. I will have dates foryou later today or tomorrow" So what is our status? Regards, Daniel
Based on the comments, its hard to tell if this is a duplicate of bug #110170 or not. 110170 will be addressed in RHEL3 U2. Is there anything above-and-beyond whats in 110170 needed for this particular issue?
When is U2 available? If the SMP kernel boots with the U2 kernel we will do some further investigations. It took RedHat 4 months to fix that problem, so we won't do any guesses if your fix for Bug #110170 fixes our problem, too. Hopefully you know what patches you added to the kernel and if that cyclon chip fix has influences on the scsi timing on x445 SMP boxes. But I think it is the first time a RedHat person officially associates both bugs, so we are optimistic about that U2 release.
Did anyone of the reporters try the solution form Bug #110170 ?
I had some customers complaining about problems with their servers running RHEL3. Their problems included : - Server Hangs during OS boot with smp kernel but works with uni proc. - OS hangs while trying to log out of the XWindow GUI mode with smp kernel. Work with uni proc kernel. I have offered them with the beta U2 release and both of these sites have declared that U2 has resolved these problems. - Samuel Benjamin - IBM
Has anyone seen this issue since Update2? I'm quite confident it was fixed by bug #110170 in Update2 and this bug should be closable.
I am also facing a similar problem. We have a HP 4mm DAT tape drive connected to the system. If the tape drive is not powered ON, the system boots without any problem. If it is in Power ON condition, the system gets messages like scsi - aborting command due to timeout: pid 20 scsi1m channel 0, id 5 mptscsih - Old Abort Scheduling ABORT SCSI IO SCSI host 1 abort timeout - resetting SCSI bus is being reset for host 1 channel 0. We are using RHEL 3.0 Update 3 (AMD-64 bit dual processor ). The tape drive is HP SureStore partNo. C5653C-60023. The system boots cleanly, if the system is Powered OFF and then ON. The reboot always fails. The system hangs at "Checking for New Hardware" -anant athavale - Bangalore
Anant, Since this bug is tracking SCSI timeout errors on i386 based IBM x440s, and your system is quite different(x86-64), you might get a better response if you file a new bug. While the symptoms might be similar, I don't believe the cause is directly related.
We have a similar problem which seems to be related to the controller driver. IBM xseries 345 redhat es 3.0 LSI Logic / Symbios Logic|53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI 2.4.21-4.ELsmp kernel (stock) duel xeon processors Any fix to this? Has it been resolved. I'm sure its with all 53c1030 controllers. Lots of post of problems but no resolutions. WE reciceve the following errors all the time. scsi : aborting command due to timeout : pid 3485633, scsi0, channel 0, id 0, lun 0 Write (10) 00 02 a5 59 7b 00 00 08 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=c4f9fe00) IOs outstanding = 31 scsi : aborting command due to timeout : pid 3485637, scsi0, channel 0, id 0, lun 0 Write (10) 00 02 ad 59 93 00 00 08 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=c4f9ea00) IOs outstanding = 31 scsi : aborting command due to timeout : pid 3485647, scsi0, channel 0, id 0, lun 0 Write (10) 00 01 89 59 ab 00 00 30 00 mptscsih: OldAbort scheduling ABORT SCSI IO (sc=c4fa0200) IOs outstanding = 31 SCSI host 0 abort (pid 3485633) timed out - resetting SCSI bus is being reset for host 0 channel 0. mptscsih: OldReset scheduling BUS_RESET (sc=c4f9fe00) IOs outstanding = 31 SCSI host 0 abort (pid 3485637) timed out - resetting SCSI bus is being reset for host 0 channel 0. mptscsih: OldReset scheduling BUS_RESET (sc=c4f9ea00) IOs outstanding = 31 SCSI host 0 abort (pid 3485647) timed out - resetting SCSI bus is being reset for host 0 channel 0. mptscsih: OldReset scheduling BUS_RESET (sc=c4fa0200) IOs outstanding = 31 SCSI Error: (0:0:0) Status=02h (CHECK CONDITION) Key=6h (UNIT ATTENTION); FRU=00h ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED" CDB: 2A 00 03 9F A9 47 00 00 20 00 SCSI Error: (0:2:0) Status=02h (CHECK CONDITION) Key=6h (UNIT ATTENTION); FRU=00h ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED" CDB: 2A 00 00 00 87 57 00 00 08 00 Thanks
we have a similar problem but only with kernel-2.4.21-20.EL and big versions. On the kernel-2.4.21-15.EL it works prime.(LSI Logic / Symbios Logic|53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI)
I'm closing this as a dup of bug 110170 based on comment #37. If anyone continues to have a problem running RHEL3 U6 (which was released just last week), please file a new bug report. Thanks. *** This bug has been marked as a duplicate of 110170 ***