Bug 456407

Summary: kernel panic on Altix 4700 System
Product: Red Hat Enterprise Linux 4 Reporter: Daniel Basabe <dbasabe>
Component: kernelAssignee: George Beshers <gbeshers>
Status: CLOSED NOTABUG QA Contact: Martin Jenner <mjenner>
Severity: urgent Docs Contact:
Priority: low    
Version: 4.7CC: jh, jolim, luyu, prarit, tee, vgoyal
Target Milestone: rcKeywords: Tracking
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-08-19 17:28:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Basabe 2008-07-23 12:23:13 UTC
Description of problem:
We have an Altix 4700 with 48 CPU's and 193 GB of RAM. This machine is supported
only in Redhat 5 but we are obliged to use RHEL 4.



Version-Release number of selected component (if applicable):
Until Rehat AS 4 update 7

How reproducible:
Ever using the machine

Steps to Reproduce:
1.Always
2.
3.
  
Actual results:

PCI BRIDGE ERROR: on 0002:00 at <rtc = 0x12dd2fea6d500>
[001.21:slot0:slab0:corelet1:bus0]
    int_status is 0x4000000, err_status is 0x4000000
    Dumping relevant registers for each bit set
	26: Incoming response xtalk command word error bit set or invalid sideband
	    Bridge Aux Error Command Word Register
<RD_PARM=0,DW_DATA_EN=0,TNUM=1,DATASIZE=2,PACTYP=RdResp>
	    PCI-X DMA Request Error Addr Reg: 0x100006117c607400
	    PCI-X DMA Request Error Attribute Reg: 0x8018900000008
PCI Bridge Error interrupt killed the system
Kernel panic - not syncing: pcibr_error_intr_handler(): Fatal Bridge Error
 Badness in panic at kernel/panic.c:118

Call Trace:
 [<a000000100017200>] show_stack+0x80/0xa0
                                sp=e0000060231c7b60 bsp=e0000060231c1328
 [<a000000100017250>] dump_stack+0x30/0x60
                                sp=e0000060231c7d30 bsp=e0000060231c1310
 [<a000000100079660>] panic+0x660/0x6a0
                                sp=e0000060231c7d30 bsp=e0000060231c1290
 [<a000000100490d00>] pcibr_error_intr_handler+0x1e0/0x200
                                sp=e0000060231c7d90 bsp=e0000060231c1250
 [<a0000001000131b0>] handle_IRQ_event+0x90/0x120
                                sp=e0000060231c7e30 bsp=e0000060231c1210
 [<a000000100013cf0>] do_IRQ+0x2d0/0x560
                                sp=e0000060231c7e30 bsp=e0000060231c11a0
 [<a000000100016170>] ia64_handle_irq+0xf0/0x1e0
                                sp=e0000060231c7e30 bsp=e0000060231c1158
 [<a00000010000f620>] ia64_leave_kernel+0x0/0x260
                                sp=e0000060231c7e30 bsp=e0000060231c1158



Expected results:


Additional info:

# uname -a
Linux trueno_ita01.csic.es 2.6.9-70.ELlargesmp #1 SMP Fri May 2 13:15:44 EDT
2008 ia64 ia64 ia64 GNU/Linux


# lspci -vvv
0001:00:01.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X
Fusion-MPT SAS (rev 01)
        Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min, 2500ns max), Cache Line Size 20
        Interrupt: pin A routed to IRQ 60
        Region 0: I/O ports at c000008000701000 [disabled] [size=256]
        Region 1: Memory at c0000081c0700000 (64-bit, non-prefetchable) [size=16K]
        Region 3: Memory at c0000081c0710000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at 0000000000800000 [disabled] [size=4M]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [98] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
                Address: 0000000000000000  Data: 0000
        Capabilities: [68] PCI-X non-bridge device.
                Command: DPERE- ERO- RBC=2 OST=4
                Status: Bus=0 Dev=1 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple,
DMMRBC=2, DMOST=6, DMCRS=4, RSCEM-
        Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1
                Vector table: BAR=1 offset=00002000
                PBA: BAR=1 offset=00003000

0001:00:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)
        Subsystem: Silicon Graphics, Inc. Dual Port Gigabit Ethernet (IA-blade)
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), Cache Line Size 20
        Interrupt: pin A routed to IRQ 61
        Region 0: Memory at c0000081c0720000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [40] PCI-X non-bridge device.
                Command: DPERE- ERO- RBC=2 OST=0
                Status: Bus=0 Dev=2 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple,
DMMRBC=2, DMOST=0, DMCRS=1, RSCEM-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: 02efb37134200098  Data: d881

0001:00:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)
        Subsystem: Silicon Graphics, Inc. Dual Port Gigabit Ethernet (IA-blade)
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
	Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 64 (16000ns min), Cache Line Size 20
	Interrupt: pin B routed to IRQ 62
	Region 0: Memory at c0000081c0730000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: [40] PCI-X non-bridge device.
		Command: DPERE- ERO- RBC=2 OST=0
		Status: Bus=0 Dev=2 Func=1 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=2,
DMOST=0, DMCRS=1, RSCEM-
	Capabilities: [48] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
	Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
		Address: 6518285c60b90598  Data: 009f

0001:00:03.0 PCI bridge: IBM PCI-X to PCI-X Bridge (rev 03) (prog-if 00 [Normal
decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 248, Cache Line Size 20
	Region 0: Memory at c0000081c0c00000 (64-bit, prefetchable) [size=1M]
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=248
	I/O behind bridge: 00002000-00002fff
	Memory behind bridge: 00d00000-00dfffff
	Prefetchable memory behind bridge: 0000000000100000-0000000000000000
	Secondary status: 66Mhz+ FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort+ <SERR- <PERR-
	BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
	Capabilities: [80] PCI-X bridge device.
		Secondary Status: 64bit+, 133MHz+, SCD-, USC-, SCO-, SRD- Freq=conv
		Status: Bus=0 Dev=3 Func=0 64bit+ 133MHz+ SCD- USC-, SCO-, SRD-
		: Upstream: Capacity=32, Commitment Limit=32
		: Downstream: Capacity=32, Commitment Limit=32
	Capabilities: [90] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0001:01:01.0 USB Controller: NEC Corporation USB (rev 43) (prog-if 10 [OHCI])
	Subsystem: NEC Corporation Hama USB 2.0 CardBus
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 8 (250ns min, 10500ns max)
	Interrupt: pin A routed to IRQ 63
	Region 0: Memory at c0000081c0d00000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [40] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0001:01:01.1 USB Controller: NEC Corporation USB (rev 43) (prog-if 10 [OHCI])
	Subsystem: NEC Corporation Hama USB 2.0 CardBus
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 8 (250ns min, 10500ns max)
	Interrupt: pin B routed to IRQ 64
	Region 0: Memory at c0000081c0d01000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [40] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0001:01:01.2 USB Controller: NEC Corporation USB 2.0 (rev 04) (prog-if 20 [EHCI])
	Subsystem: NEC Corporation USB 2.0
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 68 (4000ns min, 8500ns max), Cache Line Size 20
	Interrupt: pin C routed to IRQ 63
	Region 0: Memory at c0000081c0d02000 (32-bit, non-prefetchable) [size=256]
	Capabilities: [40] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0001:01:02.0 IDE interface: Silicon Image, Inc. PCI0680 Ultra ATA-133 Host
Controller (rev 02) (prog-if 85 [Master SecO PriO])
	Subsystem: Silicon Image, Inc. PCI0680 Ultra ATA-133 Host Controller
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 32, Cache Line Size 01
	Interrupt: pin A routed to IRQ 64
	Region 0: I/O ports at c000008000702000 [size=8]
	Region 1: I/O ports at c000008000702008 [size=4]
	Region 2: I/O ports at c000008000702010 [size=8]
	Region 3: I/O ports at c000008000702018 [size=4]
	Region 4: I/O ports at c000008000702020 [size=16]
	Region 5: Memory at c0000081c0d02100 (32-bit, non-prefetchable) [size=256]
	Expansion ROM at 0000000000d80000 [disabled] [size=512K]
	Capabilities: [60] Power Management version 2
		Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=2 PME-

0002:00:01.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X
Fusion-MPT SAS (rev 02)
	Subsystem: LSI Logic / Symbios Logic: Unknown device 30e0
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 64 (16000ns min, 2500ns max), Cache Line Size 20
	Interrupt: pin A routed to IRQ 65
	Region 0: I/O ports at c000008010701000 [disabled] [size=256]
	Region 1: Memory at c000008180700000 (64-bit, non-prefetchable) [size=16K]
	Region 3: Memory at c000008180710000 (64-bit, non-prefetchable) [size=64K]
	Expansion ROM at c0000b69ee834800 [disabled] [size=2M]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [98] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
		Address: 0000000000000000  Data: 0000
	Capabilities: [68] PCI-X non-bridge device.
		Command: DPERE- ERO- RBC=2 OST=6
		Status: Bus=0 Dev=1 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=2,
DMOST=6, DMCRS=4, RSCEM-
	Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1
		Vector table: BAR=1 offset=00002000
		PBA: BAR=1 offset=00003000

Comment 2 George Beshers 2008-07-29 16:16:11 UTC
Opened http://bugworks.engr.sgi.com/query.cgi/984894

Comment 3 George Beshers 2008-07-30 16:21:01 UTC
We need to get an errdmp and also prom info might be interesting.

  In deciphering the PCI BRIDGE ERROR registers, it shows that the LSI
  SAS1068 card at 0002:00:01.0 issued a DMA to addr 0x000006117c607400
  and got a coretalk read request sideband invalid error back due to
  that read. This type of error is usually the result of a memory read
  that results in an unrecoverable error.  Did you get an errdmp?  If
  so what did the FRU Summary show?  If not, can you please get one...



Comment 4 Daniel Basabe 2008-08-14 11:09:01 UTC
Hi

This kernel panic is the result of a hardware memory error (Double bit of ECC).
Now the machine works fine.


You can close this issue.

Thank you very much.

Comment 5 George Beshers 2008-08-19 17:28:30 UTC
Per comment #4