Description of problem: One our customer complains on LSI MegaRAID controller non-working on recent RHEL5 kernels. Version-Release number of selected component (if applicable): (5.4 release) 2.6.18-164.15.1.el5 x86_64 - works fine (and earlier versions) (5.5 release) 2.6.18-194.3.1.el5 x86_64 - timeout/disconnect (latest testkernel from jwilson@) 2.6.18-201.el5 x86_64 - timeout/disconnect How reproducible: 100% Steps to Reproduce: Just boot the node into affected kernel and you'll see messages like following (only related messages filtered): ... SCSI subsystem initialized Loading sd_mod.ko module Loading megaraid_sas.ko module megasas: 00.00.04.17-RH1 Wed, Nov. 25, 11:41:51 PST 2009 megasas: 0x1000:0x0060:0x1000:0x100a: bus 9:slot 0:func 0 GSI 19 sharing vector 0x3A and IRQ 19 ACPI: PCI Interrupt 0000:09:00.0[A] -> GSI 42 (level, low) -> IRQ 58 megasas: FW now in Ready state >>> (approx 5 minute delay) <<< scsi0: LSI SAS based MegaRAID driver >>> (approx 5 minute delay) <<< scsi 0:0:0:0: megasas: RESET -1 cmd=12 retries=0 megasas: [ 0]waiting for 1 commands to complete megasas: reset successful scsi 0:0:0:0: megasas: RESET -1 cmd=0 retries=0 megasas: [ 0]waiting for 1 commands to complete megasas: reset successful scsi 0:0:0:0: megasas: RESET -1 cmd=12 retries=0 megasas: [ 0]waiting for 1 commands to complete megasas: reset successful scsi 0:0:0:0: megasas: RESET -1 cmd=0 retries=0 megasas: [ 0]waiting for 1 commands to complete megasas: reset successful scsi 0:0:0:0: megasas: RESET -1 cmd=12 retries=0 megasas: [ 0]waiting for 1 commands to complete megasas: reset successful scsi 0:0:0:0: megasas: RESET -1 cmd=0 retries=0 megasas: [ 0]waiting for 1 commands to complete megasas: reset successful scsi 0:0:0:0: scsi: Device offlined - not ready after error recovery scsi 0:0:0:0: timing out command, waited 22s Actual results: Controller continuously resets and finally goes offline. Expected results: Controller works fine. Additional info: Controller: LSI MegaRAID 8708ELP (0x1000:0x0060:0x1000:0x100a), latest firmware 09:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 03) 09:00.0 0104: 1000:0060 (rev 03) 09:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 03) Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 8708ELP Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 177 Region 0: Memory at d8040000 (64-bit, non-prefetchable) [size=256K] Region 2: I/O ports at 3000 [size=256] Region 3: Memory at d8000000 (64-bit, non-prefetchable) [size=256K] [virtual] Expansion ROM at d8500000 [disabled] [size=128K] Capabilities: [b0] Express Endpoint IRQ 0 Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag- Device: Latency L0s unlimited, L1 unlimited Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal+ Unsupported- Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ Device: MaxPayload 128 bytes, MaxReadReq 4096 bytes Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 0 Link: Latency L0s <2us, L1 unlimited Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x4 Capabilities: [c4] Message Signalled Interrupts: 64bit+ Queue=0/2 Enable- Address: 0000000000000000 Data: 0000 Capabilities: [d4] MSI-X: Enable- Mask- TabSize=4 Vector table: BAR=0 offset=0003e000 PBA: BAR=0 offset=0003f000 Capabilities: [e0] Power Management version 2 Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [ec] Vital Product Data Capabilities: [100] Power Budgeting ============================================== 2.6.18-164.15.1.el5 has megasas driver version 00.00.04.08-RH2 while 2.6.18-194.3.1.el5 has megasas driver version 00.00.04.17-RH1, but the updated driver is NOT the problem: * 00.00.04.17-RH1 driver compiled for 2.6.18-164.15.1.el5 => system works fine * 00.00.04.08-RH2 driver compiled for 2.6.18-194.3.1.el5 => controller resets Please, let me know if you need more info on this. Thank you. -- Best regards, Konstantin Khorenko, PVCfL/OpenVZ developer, Parallels
This issue seems to be related to https://bugzilla.redhat.com/show_bug.cgi?id=578531 : 0) the system also uses bonding (mode=6 == ALB) + VLAN configuration 1) without bonding the system is stable, no controller resets. 2) changing bonding to mode=5 also makes system stable, no controller resets.
This is definitely bug# 578531 : on the 2.6.18-201.el5 debug kernel we've got the following messages (very similar to those in the mentioned bug): ... Call Trace: <IRQ> [<ffffffff80021d27>] netif_receive_skb+0x441/0x4b0 [<ffffffff881ad3e1>] :e1000e:e1000_receive_skb+0x1b5/0x1d6 [<ffffffff881b1b38>] :e1000e:e1000_clean_rx_irq+0x27a/0x321 [<ffffffff881afbe3>] :e1000e:e1000_clean+0x7c/0x2b2 [<ffffffff8000d13d>] net_rx_action+0xb6/0x1fc [<ffffffff80012ec6>] __do_softirq+0x94/0x152 [<ffffffff800613d0>] call_softirq+0x1c/0x28 [<ffffffff80070c30>] do_softirq+0x35/0xa0 [<ffffffff80070bf2>] do_IRQ+0xfb/0x104 [<ffffffff80059cea>] mwait_idle+0x0/0x54 [<ffffffff80060652>] ret_from_intr+0x0/0xf <EOI> [<ffffffff80065ff3>] __sched_text_start+0xc03/0xc3e [<ffffffff80059d29>] mwait_idle+0x3f/0x54 [<ffffffff80059cf3>] mwait_idle+0x9/0x54 [<ffffffff8004bb73>] cpu_idle+0x9a/0xbd [<ffffffff8047a82a>] start_kernel+0x243/0x248 [<ffffffff8047a22f>] _sinittext+0x22f/0x236 Code: 80 79 18 00 74 28 8b 01 3b 43 18 75 21 8b 41 04 3b 43 0e 75 RIP [<ffffffff8841e96c>] :bonding:rlb_arp_recv+0xd0/0x146 RSP <ffffffff8051bd70> Sorry, there is no serial console on the node => the log is incomplete.
(In reply to comment #2) > This is definitely bug# 578531 Closing this as a duplicate. The fix for 578531 is planned for 5.6 beta. Please test it there. *** This bug has been marked as a duplicate of bug 578531 ***