Description of problem: We have an Altix 4700 with 48 CPU's and 193 GB of RAM. This machine is supported only in Redhat 5 but we are obliged to use RHEL 4. Version-Release number of selected component (if applicable): Until Rehat AS 4 update 7 How reproducible: Ever using the machine Steps to Reproduce: 1.Always 2. 3. Actual results: PCI BRIDGE ERROR: on 0002:00 at <rtc = 0x12dd2fea6d500> [001.21:slot0:slab0:corelet1:bus0] int_status is 0x4000000, err_status is 0x4000000 Dumping relevant registers for each bit set 26: Incoming response xtalk command word error bit set or invalid sideband Bridge Aux Error Command Word Register <RD_PARM=0,DW_DATA_EN=0,TNUM=1,DATASIZE=2,PACTYP=RdResp> PCI-X DMA Request Error Addr Reg: 0x100006117c607400 PCI-X DMA Request Error Attribute Reg: 0x8018900000008 PCI Bridge Error interrupt killed the system Kernel panic - not syncing: pcibr_error_intr_handler(): Fatal Bridge Error Badness in panic at kernel/panic.c:118 Call Trace: [<a000000100017200>] show_stack+0x80/0xa0 sp=e0000060231c7b60 bsp=e0000060231c1328 [<a000000100017250>] dump_stack+0x30/0x60 sp=e0000060231c7d30 bsp=e0000060231c1310 [<a000000100079660>] panic+0x660/0x6a0 sp=e0000060231c7d30 bsp=e0000060231c1290 [<a000000100490d00>] pcibr_error_intr_handler+0x1e0/0x200 sp=e0000060231c7d90 bsp=e0000060231c1250 [<a0000001000131b0>] handle_IRQ_event+0x90/0x120 sp=e0000060231c7e30 bsp=e0000060231c1210 [<a000000100013cf0>] do_IRQ+0x2d0/0x560 sp=e0000060231c7e30 bsp=e0000060231c11a0 [<a000000100016170>] ia64_handle_irq+0xf0/0x1e0 sp=e0000060231c7e30 bsp=e0000060231c1158 [<a00000010000f620>] ia64_leave_kernel+0x0/0x260 sp=e0000060231c7e30 bsp=e0000060231c1158 Expected results: Additional info: # uname -a Linux trueno_ita01.csic.es 2.6.9-70.ELlargesmp #1 SMP Fri May 2 13:15:44 EDT 2008 ia64 ia64 ia64 GNU/Linux # lspci -vvv 0001:00:01.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS (rev 01) Subsystem: LSI Logic / Symbios Logic: Unknown device 1000 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (16000ns min, 2500ns max), Cache Line Size 20 Interrupt: pin A routed to IRQ 60 Region 0: I/O ports at c000008000701000 [disabled] [size=256] Region 1: Memory at c0000081c0700000 (64-bit, non-prefetchable) [size=16K] Region 3: Memory at c0000081c0710000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at 0000000000800000 [disabled] [size=4M] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [98] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 Capabilities: [68] PCI-X non-bridge device. Command: DPERE- ERO- RBC=2 OST=4 Status: Bus=0 Dev=1 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=2, DMOST=6, DMCRS=4, RSCEM- Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1 Vector table: BAR=1 offset=00002000 PBA: BAR=1 offset=00003000 0001:00:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) Subsystem: Silicon Graphics, Inc. Dual Port Gigabit Ethernet (IA-blade) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (16000ns min), Cache Line Size 20 Interrupt: pin A routed to IRQ 61 Region 0: Memory at c0000081c0720000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at <ignored> [disabled] Capabilities: [40] PCI-X non-bridge device. Command: DPERE- ERO- RBC=2 OST=0 Status: Bus=0 Dev=2 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=2, DMOST=0, DMCRS=1, RSCEM- Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable- Address: 02efb37134200098 Data: d881 0001:00:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) Subsystem: Silicon Graphics, Inc. Dual Port Gigabit Ethernet (IA-blade) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (16000ns min), Cache Line Size 20 Interrupt: pin B routed to IRQ 62 Region 0: Memory at c0000081c0730000 (64-bit, non-prefetchable) [size=64K] Capabilities: [40] PCI-X non-bridge device. Command: DPERE- ERO- RBC=2 OST=0 Status: Bus=0 Dev=2 Func=1 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=2, DMOST=0, DMCRS=1, RSCEM- Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable- Address: 6518285c60b90598 Data: 009f 0001:00:03.0 PCI bridge: IBM PCI-X to PCI-X Bridge (rev 03) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 248, Cache Line Size 20 Region 0: Memory at c0000081c0c00000 (64-bit, prefetchable) [size=1M] Bus: primary=00, secondary=01, subordinate=01, sec-latency=248 I/O behind bridge: 00002000-00002fff Memory behind bridge: 00d00000-00dfffff Prefetchable memory behind bridge: 0000000000100000-0000000000000000 Secondary status: 66Mhz+ FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ <SERR- <PERR- BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- Capabilities: [80] PCI-X bridge device. Secondary Status: 64bit+, 133MHz+, SCD-, USC-, SCO-, SRD- Freq=conv Status: Bus=0 Dev=3 Func=0 64bit+ 133MHz+ SCD- USC-, SCO-, SRD- : Upstream: Capacity=32, Commitment Limit=32 : Downstream: Capacity=32, Commitment Limit=32 Capabilities: [90] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 0001:01:01.0 USB Controller: NEC Corporation USB (rev 43) (prog-if 10 [OHCI]) Subsystem: NEC Corporation Hama USB 2.0 CardBus Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 8 (250ns min, 10500ns max) Interrupt: pin A routed to IRQ 63 Region 0: Memory at c0000081c0d00000 (32-bit, non-prefetchable) [size=4K] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 0001:01:01.1 USB Controller: NEC Corporation USB (rev 43) (prog-if 10 [OHCI]) Subsystem: NEC Corporation Hama USB 2.0 CardBus Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 8 (250ns min, 10500ns max) Interrupt: pin B routed to IRQ 64 Region 0: Memory at c0000081c0d01000 (32-bit, non-prefetchable) [size=4K] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 0001:01:01.2 USB Controller: NEC Corporation USB 2.0 (rev 04) (prog-if 20 [EHCI]) Subsystem: NEC Corporation USB 2.0 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 68 (4000ns min, 8500ns max), Cache Line Size 20 Interrupt: pin C routed to IRQ 63 Region 0: Memory at c0000081c0d02000 (32-bit, non-prefetchable) [size=256] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 0001:01:02.0 IDE interface: Silicon Image, Inc. PCI0680 Ultra ATA-133 Host Controller (rev 02) (prog-if 85 [Master SecO PriO]) Subsystem: Silicon Image, Inc. PCI0680 Ultra ATA-133 Host Controller Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32, Cache Line Size 01 Interrupt: pin A routed to IRQ 64 Region 0: I/O ports at c000008000702000 [size=8] Region 1: I/O ports at c000008000702008 [size=4] Region 2: I/O ports at c000008000702010 [size=8] Region 3: I/O ports at c000008000702018 [size=4] Region 4: I/O ports at c000008000702020 [size=16] Region 5: Memory at c0000081c0d02100 (32-bit, non-prefetchable) [size=256] Expansion ROM at 0000000000d80000 [disabled] [size=512K] Capabilities: [60] Power Management version 2 Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=2 PME- 0002:00:01.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS (rev 02) Subsystem: LSI Logic / Symbios Logic: Unknown device 30e0 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (16000ns min, 2500ns max), Cache Line Size 20 Interrupt: pin A routed to IRQ 65 Region 0: I/O ports at c000008010701000 [disabled] [size=256] Region 1: Memory at c000008180700000 (64-bit, non-prefetchable) [size=16K] Region 3: Memory at c000008180710000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at c0000b69ee834800 [disabled] [size=2M] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [98] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 Capabilities: [68] PCI-X non-bridge device. Command: DPERE- ERO- RBC=2 OST=6 Status: Bus=0 Dev=1 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=2, DMOST=6, DMCRS=4, RSCEM- Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1 Vector table: BAR=1 offset=00002000 PBA: BAR=1 offset=00003000
Opened http://bugworks.engr.sgi.com/query.cgi/984894
We need to get an errdmp and also prom info might be interesting. In deciphering the PCI BRIDGE ERROR registers, it shows that the LSI SAS1068 card at 0002:00:01.0 issued a DMA to addr 0x000006117c607400 and got a coretalk read request sideband invalid error back due to that read. This type of error is usually the result of a memory read that results in an unrecoverable error. Did you get an errdmp? If so what did the FRU Summary show? If not, can you please get one...
Hi This kernel panic is the result of a hardware memory error (Double bit of ECC). Now the machine works fine. You can close this issue. Thank you very much.
Per comment #4