Bug 108595 - Boot fails with SCSI ABORT IO messages when loading mptscsih module with SMP kernel
Summary: Boot fails with SCSI ABORT IO messages when loading mptscsih module with SMP ...
Keywords:
Status: CLOSED DUPLICATE of bug 110170
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
high
high
Target Milestone: ---
Assignee: Ingo Molnar
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 107562
TreeView+ depends on / blocked
 
Reported: 2003-10-30 15:11 UTC by Bob Minowicz
Modified: 2007-11-30 22:06 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-10-04 00:06:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
force clustered APIC (695 bytes, patch)
2003-12-08 21:09 UTC, Tom Coughlan
no flags Details | Diff
dmesg boot with 2.4.21-4mx.EL (16.58 KB, text/plain)
2003-12-09 16:27 UTC, Oliver Paukstadt
no flags Details
dmesg boot with 2.4.21-4mx1.EL (24.23 KB, text/plain)
2003-12-09 16:28 UTC, Oliver Paukstadt
no flags Details
dmesg boot with 2.4.21-4mx2.EL (24.15 KB, text/plain)
2003-12-09 16:28 UTC, Oliver Paukstadt
no flags Details
dmesg output 2.4.21-4.ELcnnhugemem (contains patch from bug 110170) (28.81 KB, text/plain)
2004-01-20 19:58 UTC, Jason Tolsma
no flags Details

Description Bob Minowicz 2003-10-30 15:11:04 UTC
Description of problem:

With a fresh installation of RHEL 3.0 (final release) on an IBM xSeries 445 
with four 2.5GHz Xeon CPUs, booting the SMP kernel almost always (but not 
quite 100% of the time) results in a failure to boot.  The problem occurs when 
the mptscsih module is loaded.  A sequence of SCSI ABORT IO messages begins 
and appears to continue infinitely (or at least to 77438 itterations, which 
was as far as I'd ever let it go).

The problem does not occure with the uni-processor kernel in the same 
release.  Neither does it occur with the SMP kernel in the beta 2 release nor 
with a SuSE 8.0 release, all tested on the same exact hardware.

I also notice a few messages about an unexpected IO-APIC in the output 
(included).  I do not know if they are significant.


Version-Release number of selected component (if applicable):
kernel-2.4.21-4.EL

How reproducible:
Not always but nearly so.

Steps to Reproduce:
1. Boot the SMP kernel on an IBM xSeries 445 with 4 CPUs.
    
Actual results:
Boot fails.

Expected results:
Boot succceds

Additional info:

Linux version 2.4.21-4.ELsmp (bhcompile.redhat.com) (gcc version 
3.2.
3 20030502 (Red Hat Linux 3.2.3-20)) #1 SMP Fri Oct 3 17:52:56 EDT 2003
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009c400 (usable)
 BIOS-e820: 000000000009c400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000effa1a00 (usable)
 BIOS-e820: 00000000effa1a00 - 00000000effac340 (ACPI data)
 BIOS-e820: 00000000effac340 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000210000000 (usable)
7552MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 0009c540
hm, page 0009c000 reserved twice.
hm, page 0009d000 reserved twice.
hm, page 0009d000 reserved twice.
hm, page 0009e000 reserved twice.
On node 0 totalpages: 2162688
zone(0): 4096 pages.
zone(1): 225280 pages.
zone(2): 1933312 pages.
ACPI: Searched entire block, no RSDP was found.
ACPI: RSDP located at physical address c00fdfc0
RSD PTR  v0 [IBM   ]
__va_range(0xeffac2c0, 0x68): idx=33 mapped at fffdd000
ACPI table found: RSDT v1 [IBM    SERVIGIL 0.4096]
__va_range(0xeffac240, 0x24): idx=33 mapped at fffdd000
__va_range(0xeffac240, 0x74): idx=33 mapped at fffdd000
ACPI table found: FACP v1 [IBM    SERVIGIL 0.4096]
__va_range(0xeffac180, 0x24): idx=33 mapped at fffdd000
__va_range(0xeffac180, 0x9a): idx=33 mapped at fffdd000
ACPI table found: APIC v1 [IBM    SERVIGIL 0.4096]
__va_range(0xeffac180, 0x9a): idx=33 mapped at fffdd000
LAPIC (acpi_id[0x0000] id[0x0] enabled[1])
CPU 0 (0x0000) enabledProcessor #0 Pentium 4(tm) XEON(tm) APIC version 20

LAPIC (acpi_id[0x0001] id[0x2] enabled[1])
CPU 1 (0x0200) enabledProcessor #2 Pentium 4(tm) XEON(tm) APIC version 20

LAPIC (acpi_id[0x0004] id[0x10] enabled[1])
CPU 2 (0x1000) enabledProcessor #16 Pentium 4(tm) XEON(tm) APIC version 20

LAPIC (acpi_id[0x0005] id[0x12] enabled[1])
CPU 3 (0x1200) enabledProcessor #18 Pentium 4(tm) XEON(tm) APIC version 20

IOAPIC (id[0xe] address[0xfec00000] glob
IOAPIC (id[0xd] address[0xfec01000] glob
INT_SRC_OVR (bus[0] irq[0x8] global_irq[0x8] polarity[0x3] trigger[0x1])
INT_SRC_OVR (bus[0] irq[0xe] global_irq[
INT_SRC_OVR (bus[0] irq[0xb] global_irq[
LAPIC_NMI (acpi_id[0x0000] polarity[0x0]
LAPIC_NMI (acpi_id[0x0001] polarity[0x0]
LAPIC_NMI (acpi_id[0x0004] polarity[0x0] trigger[0x0] lint[0x1])
LAPIC_NMI (acpi_id[0x0005] polarity[0x0] trigger[0x0] lint[0x1])
4 CPUs total
Local APIC address fee00000
__va_range(0xeffac0c0, 0x24): idx=33 mapped at fffdd000
__va_range(0xeffac0c0, 0xc0): idx=33 mapped at fffdd000
ACPI table found: SRAT v1 [IBM    SERVIGIL 0.4096]
__va_range(0xeffa6500, 0x24): idx=33 mapped at fffdd000
__va_range(0xeffa6500, 0x5745): idx=33 mapped at fffdd000
ACPI table found: SSDT v1 [IBM    VIGSSDT0 0.4096]
Enabling the CPU's according to the ACPI table
Intel MultiProcessor Specification v1.4
    Virtual Wire compatibility mode.
OEM ID: IBM ENSW Product ID: VIGIL SMP    APIC at: 0xFEE00000
I/O APIC #14 Version 17 at 0xFEC00000.
I/O APIC #13 Version 17 at 0xFEC01000.
Processors: 4
xAPIC support is present
Enabling APIC mode: Physical.   Using 2 I/O APICs
IBM machine detected. Enabling interrupts during APM calls.
Kernel command line: ro root=/dev/sda5 console=tty0 console=ttyS0,9600n8
Initializing CPU#0
Summit chipset: Starting Cyclone Counter.
Detected 2494.930 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 198.65 BogoMIPS
Memory: 8251588k/8650752k available (1683k kernel code, 132016k reserved, 1318k
data, 224k init, 7470724k highmem)
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode cache hash table entries: 524288 (order: 10, 4194304 bytes)
Mount cache hash table entries: 512 (order: 0, 4096 bytes)
Buffer cache hash table entries: 1048576 (order: 10, 4194304 bytes)
Page-cache hash table entries: 1048576 (order: 10, 4194304 bytes)
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: L3 cache: 1024K
CPU: Hyper-Threading is disabled
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
mtrr: v1.40 (20010327) Richard Gooch (rgooch.au)
mtrr: detected mtrr type: Intel
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: L3 cache: 1024K
CPU: Hyper-Threading is disabled
Intel machine check reporting enabled on CPU#0.
CPU0: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05
per-CPU timeslice cutoff: 1463.12 usecs.
task migration cache decay timeout: 10 msecs.
enabled ExtINT on CPU#0
Leaving ESR disabled.
Booting processor 1/2 eip 2000
Initializing CPU#1
masked ExtINT on CPU#1
Leaving ESR disabled.
Calibrating delay loop... 0.99 BogoMIPS
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: L3 cache: 1024K
CPU: Hyper-Threading is disabled
Intel machine check reporting enabled on CPU#1.
CPU1: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05
Booting processor 2/16 eip 2000
Initializing CPU#2
masked ExtINT on CPU#2
Leaving ESR disabled.
Calibrating delay loop... 1.01 BogoMIPS
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: L3 cache: 1024K
CPU: Hyper-Threading is disabled
Intel machine check reporting enabled on CPU#2.
CPU2: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05
Booting processor 3/18 eip 2000
Initializing CPU#3
masked ExtINT on CPU#3
Leaving ESR disabled.
Calibrating delay loop... 1.88 BogoMIPS
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: L3 cache: 1024K
CPU: Hyper-Threading is disabled
Intel machine check reporting enabled on
CPU3: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05
Total of 4 processors activated (202.56
apic 0 pin 46 is an SMI pin!
ENABLING IO-APIC IRQs
Setting 14 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 14 ... ok.
Setting 13 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 13 ... ok.
..TIMER: vector=0x31 pin1=0 pin2=-1
testing the IO APIC.......................

An unexpected IO-APIC was found. If this kernel release is less than
three months old please report this to linux-smp.org

An unexpected IO-APIC was found. If this kernel release is less than
three months old please report this to linux-smp.org
.................................... done.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 2494.5906 MHz.
..... host bus clock speed is 99.7835 MHz.
cpu: 0, clocks: 997835, slice: 199567
CPU0<T0:997824,T1:798256,D:1,S:199567,C:997835>
cpu: 1, clocks: 997835, slice: 199567
cpu: 2, clocks: 997835, slice: 199567
cpu: 3, clocks: 997835, slice: 199567
CPU2<T0:997824,T1:399120,D:3,S:199567,C:997835>
CPU3<T0:997824,T1:199552,D:4,S:199567,C:997835>
CPU1<T0:997824,T1:598688,D:2,S:199567,C:997835>
zapping low mappings.
Process timing init...done.
Starting migration thread for cpu 0
Starting migration thread for cpu 1
Starting migration thread for cpu 2
Starting migration thread for cpu 3
PCI: PCI BIOS revision 2.10 entry at 0xfd47d, last bus=11
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: Discovered peer bus 01
PCI: Discovered peer bus 02
PCI: Discovered peer bus 05
PCI: Discovered peer bus 07
PCI: Discovered peer bus 09
PCI->APIC IRQ transform: (B0,I3,P0) -> 39
PCI->APIC IRQ transform: (B0,I4,P0) -> 16
PCI->APIC IRQ transform: (B0,I5,P3) -> 18
PCI->APIC IRQ transform: (B0,I5,P3) -> 18
PCI->APIC IRQ transform: (B1,I3,P0) -> 40
PCI->APIC IRQ transform: (B1,I3,P1) -> 41
PCI->APIC IRQ transform: (B1,I4,P0) -> 42
PCI->APIC IRQ transform: (B1,I4,P1) -> 11
PCI->APIC IRQ transform: (B5,I4,P0) -> 71
PCI: Enabling Via external APIC routing
PCI: Via IRQ fixup for 00:05.2, from 11 to 2
PCI: Via IRQ fixup for 00:05.3, from 11 to 2
isapnp: Scanning for PnP cards...
isapnp: No Plug & Play device found
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
apm: BIOS not found.
Total HugeTLB memory allocated, 0
Starting kswapd
allocated 32 pages and 32 bhs reserved for the highmem bounces
VFS: Disk quotas vdquot_6.5.1
aio_setup: num_physpages = 540672
aio_setup: sizeof(struct page) = 60
Hugetlbfs mounted.
pty: 2048 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT SHARE_IRQ 
SER
IAL_PCI ISAPNP enabled
ttyS0 at 0x03f8 (irq = 4) is a 16550A
Real Time Clock Driver v1.10e
NET4: Frame Diverter 0.46
RAMDISK driver initialized: 256 RAM disks of 8192K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: IDE controller at PCI slot 00:05.1
VP_IDE: chipset revision 6
VP_IDE: not 100% native mode: will probe irqs later
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci00:05.1
    ide0: BM-DMA at 0x0700-0x0707, BIOS settings: hda:pio, hdb:pio
    ide1: BM-DMA at 0x0708-0x070f, BIOS settings: hdc:pio, hdd:pio
hda: MATSHITADVD-ROM SR-8177, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide-floppy driver 0.99.newide
ide-floppy driver 0.99.newide
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
Initializing Cryptographic API
NET4: Linux TCP/IP 1.0 for NET4.0
IP: routing cache hash table of 131072 b
TCP: Hash tables configured (established
Linux IP multicast router 0.06 plus PIM-SM
Initializing IPsec netlink socket
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
RAMDISK: Compressed image found at block 0
Freeing initrd memory: 276k freed
VFS: Mounted root (ext2 filesystem).
Red Hat nash verSCSI subsystem driver Revision: 1.00
sion 3.5.13 starFusion MPT base driveting
Loading scr 2.05.05+
Copyright (c) 1999-2002 LSI Logic Corporation

o module                                     Loading sd_mod.
Loadinmptbase: Initiating ioc0 bringup
g mptbase.o module
ioc0: 53C1030: Capabilities={Initiator}
mptbase: Initiating ioc1 bringup
ioc1: 53C1030: Capabilities={Initiator}
mptbase: 2 MPT adapters found, 2 installed.
Loading mptscsihFusion MPT SCSI Host driver 2.05.05+
.o module
scsi0 : ioc0: LSI53C1030, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=40
scsi1 : ioc1: LSI53C1030, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=41
Starting timer : 0 0
blk: queue f678ae18, I/O limit 4294967295Mb (mask 0xffffffffffffffff)
  Vendor: IBM-ESXS  Model: MAP3367NC     FN  Rev: B109
  Type:   Direct-Access                      ANSI SCSI revision: 03
Starting timer : 0 0
blk: queue f678ac18, I/O limit 4294967295Mb (mask 0xffffffffffffffff)
  Vendor: IBM-ESXS  Model: MAP3367NC     FN  Rev: B109
  Type:   Direct-Access                      ANSI SCSI revision: 03
Starting timer : 0 0
blk: queue f678aa18, I/O limit 4294967295Mb (mask 0xffffffffffffffff)
scsi : aborting command due to timeout : pid 2, scsi0, channel 0, id 2, lun 0 
In
quiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f678a800)
  IOs outstanding = 1
scsi : aborting command due to timeout : pid 3, scsi0, channel 0, id 3, lun 0 
In
quiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 2
SCSI host 0 abort (pid 3) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
mptscsih: OldReset scheduling BUS_RESET (sc=f66ffc00)
  IOs outstanding = 1
scsi : aborting command due to timeout : pid 4, scsi0, channel 0, id 4, lun 0 
In
quiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 1
scsi : aborting command due to timeout : pid 5, scsi0, channel 0, id 5, lun 0 
In
quiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 2
SCSI host 0 abort (pid 5) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
mptscsih: OldReset scheduling BUS_RESET (sc=f66ffc00)
  IOs outstanding = 1
mptbase: Initiating ioc0 recovery
mptbase: ioc0: WARNING - Unexpected doorbell active!
mptbase: ioc0: ERROR - Wait IOC_READY state timeout(1500)!
mptbase: ioc0: ERROR - Failed to come READY after reset!
mptbase: ioc0 NOT READY WARNING!
mptbase: WARNING - (-1) Cannot recover ioc0
 Firmware Reload FAILED!!
scsi : aborting command due to timeout : pid 6, scsi0, channel 0, id 6, lun 0 
In
quiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 7, scsi0, channel 0, id 6, lun 0 
In
quiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 8, scsi0, channel 0, id 6, lun 0 
In
quiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 9, scsi0, channel 0, id 6, lun 0 
In
quiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 10, scsi0, channel 0, id 6, lun 0 
I
nquiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 11, scsi0, channel 0, id 6, lun 0 
I
nquiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 12, scsi0, channel 0, id 6, lun 0 
I
nquiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI
  IOs outstanding = 0
scsi : aborting command due to timeout :
nquiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 14, scsi0, channel 0, id 6, lun 0 
I
nquiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 15, scsi0, channel 0, id 6, lun 0 
I
nquiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 16, scsi0, channel 0, id 6, lun 0 
I
nquiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 17, scsi0, channel 0, id 6, lun 0 
I
nquiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 18, scsi0, channel 0, id 6, lun 0 
I
nquiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0
scsi : aborting command due to timeout : pid 19, scsi0, channel 0, id 6, lun 0 
I
nquiry 00 00 00 ff 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f66ffc00)
  IOs outstanding = 0

Comment 2 Tom Coughlan 2003-10-31 19:47:58 UTC
Commands are timing out during the initial probe of the bus. It seems as though
scsi0, channel 0, ids 0 and 1 complete okay, but ids 2 and above fail. 

The only change in the mptfusion driver between beta 2 and final was that
vary_io was enabled. This would not cause a problem during the initial probe of
the SCSI bus.

I will investigate further.

Comment 3 Jason Tolsma 2003-11-04 20:29:18 UTC
I am having the same problem installing RHAS3 on a 4way x445 with 8G
RAM.  I receive the same errors as the first poster mentioned.

I am using kernel : 2.4.21-4.ELsmp

I downloaded the source for the kernel and played with some settings
to see if I could get it to load.  I was able to get it to load with
no scsi errors by setting CONFIG_X86_SUMMIT=n, this was the only
changes I made to the kernel.

This diff is between the default .config and the changed .config:

87,88c87
< CONFIG_X86_SUMMIT=y
< CONFIG_X86_CLUSTERED_APIC=y
---
> # CONFIG_X86_SUMMIT is not set

I thought this might help in tracking down the bug.

Even with CONFIG_X86_SUMMIT turned off I am able to see all 4
processors, is this supposed to happen?

If you want more info I can specify greater details, just email me.

Thanks for any help.

Comment 5 Oliver Paukstadt 2003-12-03 15:34:26 UTC
I have the same x445 with 8GB / 4CPUs and the same problem.

I tried it with adding noapic to the kernel append line, but this did
not fix the problem. 



Comment 6 Jason Tolsma 2003-12-03 15:45:01 UTC
I have an addition to this problem.
I tried installing an IBM ServeRaid 6M card, which uses the ips kernel
module instead of mptsci, into the 4-way box.  I was able to get it to
boot to a prompt, but was not able to access the box in any way. 
Everytime I tried to login a SCSI I/O Timeout would flash and then the
prompt would say login timed out immediately.  So a different module
did not fix the problem.
Please note that I did reinstall the O/S onto the ServeRaid 6M hard
drives and disabled the LSI cards in the BIOS.

Comment 8 Tom Coughlan 2003-12-08 21:09:19 UTC
Created attachment 96406 [details]
force clustered APIC

Comment 9 Tom Coughlan 2003-12-08 21:10:01 UTC
The processor IDs on this system are 0, 2, 16, and 18. This may be
causing us to use physical APIC mode when we should be using clustered
APIC mode.

The above patch from Ingo forces clustered APIC mode. Can you apply it
and see whether it fixes your boot problem? 

Comment 10 Oliver Paukstadt 2003-12-09 16:25:05 UTC
I did some tests today.

2.4.21-4mx.EL      added this patch
2.4.21-4mx1.ELsmp  edit config to # CONFIG_X86_SUMMIT is not set
2.4.21-4mx2.ELsmp  added this patch and patch 110170

I was not able to boot 2.4.21-4mx.ELsmp, this caused the same problem
as before.

I attached the dmesgs of the different tries.

Comment 11 Oliver Paukstadt 2003-12-09 16:27:37 UTC
Created attachment 96425 [details]
dmesg boot with 2.4.21-4mx.EL

Comment 12 Oliver Paukstadt 2003-12-09 16:28:17 UTC
Created attachment 96426 [details]
dmesg boot with 2.4.21-4mx1.EL

Comment 13 Oliver Paukstadt 2003-12-09 16:28:49 UTC
Created attachment 96427 [details]
dmesg boot with 2.4.21-4mx2.EL

Comment 14 Oliver Paukstadt 2003-12-11 10:06:18 UTC
One mistake on our side: customer told me that they have 4 CPU modules
installed, I had a closer look into the hardware now and only 2 CPU
modules are installed, I'm going to examine the other x445 now.

But this would mean, that the fix with bug 110170 seems to work.
But is the low bogomips rate ok?



Comment 15 Jason Tolsma 2003-12-12 17:10:04 UTC
I have tried both this patch and the patch from bug 110170 that Oliver
has linked too, on my 4 way XEON MP 2.8GHZ IBM x445.

This patch does not seem to do the trick as it still fails with SCSI
I/O timeouts.  The patch from bug 110170 works as far as getting the
box up.  I am currently testing to see if it performs properly.

Comment 16 Seth Vidal 2004-01-06 19:51:37 UTC
I'm seeing the same scsi time out behavior on a rhl 9 machine running
kernel 2.4.20-24.9smp. It's got HyperThreading enabled so it shows 4
processors. This is a dual 2.8ghz xeon in a dell poweredge 2600.

I can recreate the timeouts by running tiobench on the disks. I'm
going to try disabling HT and see if it has any effect on the scsi
timeouts.

The scsi card giving the error is:
LSI Logic / Symbios Logic 53c1030 (rev 07)
using: mptscsih

There is also an Adaptec AHA-3960D / AIC-7899A U160/m (rev 01)
using: aic7xxx in the system too.

Think this is related?

Comment 17 Jim Richard 2004-01-10 01:19:37 UTC
I have the same problem.... xSeries 445 2way 2.5Ghz. 3 out of 4 boot 
attempts fail. Same APic, and SCSI errors as the original submitter.
I've tried with Hiper Threading on and off and see the same behavior.
Without HyPer-Threading CPU IDs are 0 & 1, With Hyper-Threading CPU 
ids are 0,1,16, & 17 (on CPU 0,1,2, &3)

Not mentioned in earlier comments, but I am also seeing the 
following, when running with hyper-threading turned on.:

 kernel: ACPI tables and CPU MSR values mismatch about cpu number
 kernel: CPU: Physical Processor ID: 8

Comment 18 Theo Van Dinter 2004-01-15 22:29:32 UTC
Same issue here.

445, 4x processors w/ hyperthreading enabled (8 procs therefore), 16GB memory, 
and an IBM serveraid card.

The AS3 installer works fine, but upon reboot w/ kernel, kernel-smp, or kernel-
hugemem, there are i/o errors from an mpt module, and the screen goes blank and 
that's it.  I was able to boot the machine into single user mode once, but I think that 
may just have been a fluke.

With AS2.1 and a x440, I need the summit kernel, which doesn't seem to exist in the 
AS3 distro.

Comment 19 Matt Barnes 2004-01-20 12:49:54 UTC
I am also having the same problem on an x445.

Comment 20 Bob Johnson 2004-01-20 16:06:11 UTC
For felicity  - RHEL 3 - we no longer have the stand alone
summit kernel - it is integrated. 

Comment 21 keith mannth 2004-01-20 18:11:31 UTC
  Jason,
  About comment 15.  To clarify the patch from bug 110170 allows your
box to boot and run properly.  You do not see the scsi timout errors
with this patch. 
   This looks to looks to be the same problem.  It this patch fixed
you problem can you attach your boot log?  I would like to make sure
things look ok.   

Comment 22 Jason Tolsma 2004-01-20 19:58:59 UTC
Created attachment 97137 [details]
dmesg output 2.4.21-4.ELcnnhugemem (contains patch from bug 110170)

The patch from bug 110170 does fix the SCSI timeout errors I was seeing.

Comment 23 James Cleverdon 2004-01-21 01:57:29 UTC
Ingo's patch (id=96406) is a good idea regardless. Summit boxes,
especially the x440 and x445, need to use clustered APIC mode when
they have more than 2 CPUs.  The "(num_processors >
FLAT_APIC_CPU_MAX)" part of the test was a bad idea, given that
FLAT_APIC_CPU_MAX is defined to be 8.


Comment 24 Jason Tolsma 2004-01-21 14:11:39 UTC
I left out the force clustered APIC patch and I have been doing some
testing, so far it has been less than dazzling.  What does the
clustered APIC give over the local APIC?
I will re-patch my kernel to include the forceapic patch (id=96406)
and see if that changes anything.

Comment 26 James Cleverdon 2004-01-22 04:20:23 UTC
> "What does clustered APIC over the local APIC?"
Functionality. Summit boxes only work in clustered APIC mode (unless
you're using PIC emulation, which is a whole different bug ;^). Well,
you can get by with flat mode if you only have 1 or 2 CPUs, but that's
hardly cost effective.

Clustered APIC mode changes how the APICs (both I/O and local) address
interrupts. It is used for larger systems that may grow above 8 CPUs.

Comment 27 Bob Johnson 2004-01-22 14:54:01 UTC
James and Chris at IBM,
Are you in agreement with Ingo's patch and have you tested it ?
Does it fix this issue ?

Comment 28 john stultz 2004-01-22 18:29:25 UTC
Bob,  
	The original SCSI timeout issue in this bug is due to bug 
#110170. The patch contained there has been tested and solves a 
number of problems that have shown up recently. Note comment #22 for 
external confirmation. Please pick up that fix.  
 
I don't believe Ingo's patch solves this specific issue, however it 
looks to be a good idea regardless. I have not tested it myself, but 
will defer to James for further comments.  

Comment 29 John Birck 2004-01-28 14:55:55 UTC
Same issue, IBMx445 two way 3GHz, 8GB RAM, tried AS3.0 Kernel
2.4.21-9smp from beta-channel, does not boot because of scsi error.
But AS2.1 u3 works fine!

Comment 30 Chris McDermott 2004-01-28 16:55:15 UTC
Question for Red Hat. It appears that we have a number of x445
customers running into the SCSI timeout problem. Will there be a
kernel update to the RHEL3 U1 kernel (2.4.21-9smp) that will include
the patch for bug 110170 discussed above? Or will these customers have
to wait for the next RHEL3 update?

Comment 32 Daniel Riek 2004-02-02 10:39:49 UTC
FYI. I just received info from a customer (Baaderbank) what IBM is saying:

From: Melanie Kiehnle <KIEHNLE.com>
Date: Thu, 29 Jan 2004 10:10:26 +0100

[...]"The required RHEL 3 drivers are now available and I am working
with SusanMcleod on our schedule.  My feeling is that we willbe
prioritizing the x445fortest.  I will have dates foryou later today or
tomorrow"

So what is our status?

Regards, Daniel

Comment 33 Tim Burke 2004-02-26 17:02:59 UTC
Based on the comments, its hard to tell if this is a duplicate of bug
#110170 or not.  110170 will be addressed in RHEL3 U2.  Is there
anything above-and-beyond whats in 110170 needed for this particular
issue?

Comment 34 Oliver Paukstadt 2004-02-26 22:11:28 UTC
When is U2 available?

If the SMP kernel boots with the U2 kernel we will do some further
investigations.

It took RedHat 4 months to fix that problem, so we won't do any
guesses if your fix for Bug #110170 fixes our problem, too.

Hopefully you know what patches you added to the kernel and if that
cyclon chip fix has influences on the scsi timing on x445 SMP boxes.

But I think it is the first time a RedHat person officially associates
both bugs, so we are optimistic about that U2 release.


Comment 35 Daniel Riek 2004-04-13 15:11:20 UTC
Did anyone of the reporters try the solution form Bug #110170 ? 

Comment 36 Samuel Benjamin 2004-05-21 17:50:07 UTC
I had some customers complaining about problems with their servers 
running RHEL3. Their problems included :
- Server Hangs during OS boot with smp kernel but works with uni proc.
- OS hangs while trying to log out of the XWindow GUI mode with smp 
kernel. Work with uni proc kernel.

I have offered them with the beta U2 release and both of these sites 
have declared that U2 has resolved these problems. 

- Samuel Benjamin - IBM

Comment 37 john stultz 2004-11-24 21:30:00 UTC
Has anyone seen this issue since Update2? I'm quite confident it was 
fixed by bug #110170 in Update2 and this bug should be closable. 

Comment 38 Anant Athavale 2004-12-13 07:08:58 UTC
I am also facing a similar problem.  We have a HP 4mm DAT tape drive
connected to the system.  If the tape drive is not powered ON, the
system boots without any problem.  If it is in Power ON condition, the
system gets messages like

scsi - aborting command due to timeout: pid 20 scsi1m channel 0, id 5
mptscsih - Old Abort Scheduling ABORT SCSI IO
SCSI host 1 abort timeout - resetting
SCSI bus is being reset for host 1 channel 0.

We are using RHEL 3.0 Update 3 (AMD-64 bit dual processor ).

The tape drive is HP SureStore partNo. C5653C-60023.

The system boots cleanly, if the system is Powered OFF and then ON. 
The reboot always fails.

The system hangs at "Checking for New Hardware"

-anant athavale - Bangalore



Comment 39 john stultz 2005-01-10 18:50:53 UTC
Anant, 
    Since this bug is tracking SCSI timeout errors on i386 based IBM
x440s, and your system is quite different(x86-64), you might get a
better response if you file a new bug. While the symptoms might be
similar, I don't believe the cause is directly related.

Comment 40 Dan Slowik 2005-01-27 20:08:04 UTC
We have a similar problem which seems to be related to the controller
driver.

IBM xseries 345 
redhat es 3.0
LSI Logic / Symbios Logic|53c1030  PCI-X Fusion-MPT Dual Ultra320 SCSI
2.4.21-4.ELsmp kernel (stock)
duel xeon processors

Any fix to this?  Has it been resolved.  I'm sure its with all 53c1030
controllers.  Lots of post of problems but no resolutions.

WE reciceve the following errors all the time.

scsi : aborting command due to timeout : pid 3485633, scsi0, channel
0, id 0, lun 0 Write (10) 00 02 a5 59 7b 00 00 08 00 
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=c4f9fe00)
  IOs outstanding = 31
scsi : aborting command due to timeout : pid 3485637, scsi0, channel
0, id 0, lun 0 Write (10) 00 02 ad 59 93 00 00 08 00 
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=c4f9ea00)
  IOs outstanding = 31
scsi : aborting command due to timeout : pid 3485647, scsi0, channel
0, id 0, lun 0 Write (10) 00 01 89 59 ab 00 00 30 00 
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=c4fa0200)
  IOs outstanding = 31
SCSI host 0 abort (pid 3485633) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
mptscsih: OldReset scheduling BUS_RESET (sc=c4f9fe00)
  IOs outstanding = 31
SCSI host 0 abort (pid 3485637) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
mptscsih: OldReset scheduling BUS_RESET (sc=c4f9ea00)
  IOs outstanding = 31
SCSI host 0 abort (pid 3485647) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
mptscsih: OldReset scheduling BUS_RESET (sc=c4fa0200)
  IOs outstanding = 31
SCSI Error: (0:0:0) Status=02h (CHECK CONDITION)
 Key=6h (UNIT ATTENTION); FRU=00h
 ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
 CDB: 2A 00 03 9F A9 47 00 00 20 00

SCSI Error: (0:2:0) Status=02h (CHECK CONDITION)
 Key=6h (UNIT ATTENTION); FRU=00h
 ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
 CDB: 2A 00 00 00 87 57 00 00 08 00


Thanks

Comment 41 ilja lunev 2005-05-06 09:33:24 UTC
we have a similar problem but only with kernel-2.4.21-20.EL and big versions.
On the kernel-2.4.21-15.EL it works prime.(LSI Logic / Symbios Logic|53c1030 
PCI-X Fusion-MPT Dual Ultra320 SCSI)

Comment 42 Ernie Petrides 2005-10-04 00:06:37 UTC
I'm closing this as a dup of bug 110170 based on comment #37.

If anyone continues to have a problem running RHEL3 U6 (which was
released just last week), please file a new bug report.  Thanks.


*** This bug has been marked as a duplicate of 110170 ***


Note You need to log in before you can comment on or make changes to this bug.