Bug 602197

Summary: LSI MegaRAID controller continuously resets on RHEL5.5 and newer
Product: Red Hat Enterprise Linux 5 Reporter: Konstantin Khorenko <khorenko>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: coughlan
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-09-27 14:10:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Konstantin Khorenko 2010-06-09 11:42:52 UTC
Description of problem:
One our customer complains on LSI MegaRAID controller non-working on recent RHEL5 kernels.

Version-Release number of selected component (if applicable):
(5.4 release) 2.6.18-164.15.1.el5 x86_64 - works fine (and earlier versions)
(5.5 release) 2.6.18-194.3.1.el5  x86_64 - timeout/disconnect
(latest testkernel from jwilson@) 2.6.18-201.el5      x86_64 - timeout/disconnect

How reproducible:
100%

Steps to Reproduce:
Just boot the node into affected kernel and you'll see messages like following (only related messages filtered):
...
SCSI subsystem initialized
Loading sd_mod.ko module
Loading megaraid_sas.ko module
megasas: 00.00.04.17-RH1 Wed, Nov. 25, 11:41:51 PST 2009
megasas: 0x1000:0x0060:0x1000:0x100a: bus 9:slot 0:func 0
GSI 19 sharing vector 0x3A and IRQ 19
ACPI: PCI Interrupt 0000:09:00.0[A] -> GSI 42 (level, low) -> IRQ 58
megasas: FW now in Ready state
>>> (approx 5 minute delay) <<<
scsi0: LSI SAS based MegaRAID driver
>>> (approx 5 minute delay) <<<
scsi 0:0:0:0: megasas: RESET -1 cmd=12 retries=0
megasas: [ 0]waiting for 1 commands to complete
megasas: reset successful
scsi 0:0:0:0: megasas: RESET -1 cmd=0 retries=0
megasas: [ 0]waiting for 1 commands to complete
megasas: reset successful
scsi 0:0:0:0: megasas: RESET -1 cmd=12 retries=0
megasas: [ 0]waiting for 1 commands to complete
megasas: reset successful
scsi 0:0:0:0: megasas: RESET -1 cmd=0 retries=0
megasas: [ 0]waiting for 1 commands to complete
megasas: reset successful
scsi 0:0:0:0: megasas: RESET -1 cmd=12 retries=0
megasas: [ 0]waiting for 1 commands to complete
megasas: reset successful
scsi 0:0:0:0: megasas: RESET -1 cmd=0 retries=0
megasas: [ 0]waiting for 1 commands to complete
megasas: reset successful
scsi 0:0:0:0: scsi: Device offlined - not ready after error recovery
scsi 0:0:0:0: timing out command, waited 22s
  
Actual results:
Controller continuously resets and finally goes offline.

Expected results:
Controller works fine.

Additional info:
Controller: LSI MegaRAID 8708ELP (0x1000:0x0060:0x1000:0x100a), latest firmware

09:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 03)
09:00.0 0104: 1000:0060 (rev 03)
09:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 03)
	Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 8708ELP
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 177
	Region 0: Memory at d8040000 (64-bit, non-prefetchable) [size=256K]
	Region 2: I/O ports at 3000 [size=256]
	Region 3: Memory at d8000000 (64-bit, non-prefetchable) [size=256K]
	[virtual] Expansion ROM at d8500000 [disabled] [size=128K]
	Capabilities: [b0] Express Endpoint IRQ 0
		Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag-
		Device: Latency L0s unlimited, L1 unlimited
		Device: AtnBtn- AtnInd- PwrInd-
		Device: Errors: Correctable- Non-Fatal- Fatal+ Unsupported-
		Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
		Device: MaxPayload 128 bytes, MaxReadReq 4096 bytes
		Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 0
		Link: Latency L0s <2us, L1 unlimited
		Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
		Link: Speed 2.5Gb/s, Width x4
	Capabilities: [c4] Message Signalled Interrupts: 64bit+ Queue=0/2 Enable-
		Address: 0000000000000000  Data: 0000
	Capabilities: [d4] MSI-X: Enable- Mask- TabSize=4
		Vector table: BAR=0 offset=0003e000
		PBA: BAR=0 offset=0003f000
	Capabilities: [e0] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [ec] Vital Product Data
	Capabilities: [100] Power Budgeting
==============================================

2.6.18-164.15.1.el5 has megasas driver version 00.00.04.08-RH2
while 2.6.18-194.3.1.el5 has megasas driver version 00.00.04.17-RH1,

but the updated driver is NOT the problem:
* 00.00.04.17-RH1 driver compiled for 2.6.18-164.15.1.el5 => system works fine
* 00.00.04.08-RH2 driver compiled for 2.6.18-194.3.1.el5 => controller resets

Please, let me know if you need more info on this.
Thank you.

--
Best regards,

Konstantin Khorenko,
PVCfL/OpenVZ developer,
Parallels

Comment 1 Konstantin Khorenko 2010-06-11 15:06:38 UTC
This issue seems to be related to https://bugzilla.redhat.com/show_bug.cgi?id=578531 :

0) the system also uses bonding (mode=6 == ALB) + VLAN configuration
1) without bonding the system is stable, no controller resets.
2) changing bonding to mode=5 also makes system stable, no controller resets.

Comment 2 Konstantin Khorenko 2010-06-15 11:23:26 UTC
This is definitely bug# 578531 : on the 2.6.18-201.el5 debug kernel we've got the following messages (very similar to those in the mentioned bug):

...
Call Trace:
 <IRQ>  [<ffffffff80021d27>] netif_receive_skb+0x441/0x4b0
 [<ffffffff881ad3e1>] :e1000e:e1000_receive_skb+0x1b5/0x1d6
 [<ffffffff881b1b38>] :e1000e:e1000_clean_rx_irq+0x27a/0x321
 [<ffffffff881afbe3>] :e1000e:e1000_clean+0x7c/0x2b2
 [<ffffffff8000d13d>] net_rx_action+0xb6/0x1fc
 [<ffffffff80012ec6>] __do_softirq+0x94/0x152
 [<ffffffff800613d0>] call_softirq+0x1c/0x28
 [<ffffffff80070c30>] do_softirq+0x35/0xa0
 [<ffffffff80070bf2>] do_IRQ+0xfb/0x104
 [<ffffffff80059cea>] mwait_idle+0x0/0x54
 [<ffffffff80060652>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff80065ff3>] __sched_text_start+0xc03/0xc3e
 [<ffffffff80059d29>] mwait_idle+0x3f/0x54
 [<ffffffff80059cf3>] mwait_idle+0x9/0x54
 [<ffffffff8004bb73>] cpu_idle+0x9a/0xbd
 [<ffffffff8047a82a>] start_kernel+0x243/0x248
 [<ffffffff8047a22f>] _sinittext+0x22f/0x236


Code: 80 79 18 00 74 28 8b 01 3b 43 18 75 21 8b 41 04 3b 43 0e 75
RIP  [<ffffffff8841e96c>] :bonding:rlb_arp_recv+0xd0/0x146
  RSP <ffffffff8051bd70>

Sorry, there is no serial console on the node => the log is incomplete.

Comment 3 Tom Coughlan 2010-09-27 14:10:48 UTC
(In reply to comment #2)
> This is definitely bug# 578531 

Closing this as a duplicate. 

The fix for 578531 is planned for 5.6 beta. Please test it there.

*** This bug has been marked as a duplicate of bug 578531 ***