Description of problem when running IO's with controller failover and drive failover from an intel x86 server running RHEL 5 attached to an LSI array through an Emulex Fiber Channel HBA, the host hangs after 10 hours and reports soft lockups on CPU2 through the console. the issue have also been seen on x64 and ia64 architectures. Version-Release number of selected component (if applicable): RHEL 5 How reproducible: Steps to Reproduce: 1. connect a server running RHEL 5 (x86, x64 or ia64) with an Emulex (2 or 4 Gb) HBA to an LSI 6091 array. 2. map 32 luns from the LSI array to the host and start IO's. 3. start the controller failover drive failover ( CFDF) test. Actual results: After about 10 hours, the host will freeze and the console will display the message: BUG: soft lockup detected on CPU#2! [<c044a0b7>] softlockup_tick+0x98/0xa6 [<c042cc98>] update_process_times+0x39/0x5c [<c04176ec>] smp_apic_timer_interrupt+0x5c/0x64 [<c04049bf>] apic_timer_interrupt+0x1f/0x24 [<c05fc388>] _spin_lock+0x7/0xf [<f8967706>] lpfc_intr_handler+0x3f/0x228 [lpfc] [<c044a2e7>] handle_IRQ_event+0x23/0x49 [<c044a3a0>] __do_IRQ+0x93/0xe8 [<c04063f4>] do_IRQ+0x93/0xae [<c040492e>] common_interrupt+0x1a/0x20 [<c0402ccf>] mwait_idle+0x25/0x38 [<c0402c90>] cpu_idle+0x9f/0xb9 ======================= Expected results: the test should pass the 24 hours run. Additional info: HBA Vendor: Emulex HBA Model Name: LPe11000-M4 and LP10000 HBA driver: Emulex LightPulse Fibre Channel SCSI driver 8.1.10.3 OS: RHEL 5 Kernel: Linux timon 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:21 EST 2007 i686 i686 i386 GNU/Linux
Adding Larry Troan - Larry, this was experienced with an FSC box. Can you add the appropriate people to this BZ?
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Adding Emulex to this bug.
After updating the Emulex HBA driver to 8.1.10.9 and restarted the CFDF test, the kernel showed a panic on the x86 machine. The console output shows the messages below: CPU 3: Machine Check Exception: 0000000000000004 CPU 2: Machine Check Exception: 0000000000000004 Kernel panic - not syncing: Unable to continue <7>CPU 3: EIP: c0415921 EFLAGS: 00000293 Kernel panic - not syncing: Unable to continue
Created attachment 155751 [details] /var/log/messages file from the x64 server showing soft lockups
Created attachment 155913 [details] /var/log/messages from a RHEL 5 x86 server with Qlogic hbas
(In reply to comment #4) > CPU 3: Machine Check Exception: 0000000000000004 > CPU 2: Machine Check Exception: 0000000000000004 > Kernel panic - not syncing: Unable to continue > <7>CPU 3: EIP: c0415921 EFLAGS: 00000293 > Kernel panic - not syncing: Unable to continue MCEs are usually indicative of hardware failures of some sort. Has this been seen on multiple systems? Is the full console log from this incident available?
(In reply to comment #7) > (In reply to comment #4) > > CPU 3: Machine Check Exception: 0000000000000004 > > CPU 2: Machine Check Exception: 0000000000000004 > > Kernel panic - not syncing: Unable to continue > > <7>CPU 3: EIP: c0415921 EFLAGS: 00000293 > > Kernel panic - not syncing: Unable to continue > MCEs are usually indicative of hardware failures of some sort. Has this been > seen on multiple systems? Is the full console log from this incident available? No. This happenend only once on one particular host. We figured out that may have been caused by some hardware overheating. Now the main concern is the CPU soft lockup that's being seen on all the Machines. it looks like the lockup is comming from the scsi layer stack. It happenend when scsi layer functions are being called by other instances (like the hba driver or the multipth driver).
Is there any updates from RedHat on that issue? is there any more info that LSI can provide to get the soft lockup issue resolved? it's blocking our testing.
Guessing who at Fujitsu-Siemens should be added here since Anders Karlsson handles bugs for them. This is required because this is a cross-patner bug exhibited on FSC hardware... * Martin.Wilck Anders, can you add anyone else from FSC you believe should be on the access list.
Note to Emulex: This may need some attention from you all, the LSI array is attached to an Emulex controller. Would you be able to have a first stab at this?
I am seeing the soft lockups on Servers with Qlogic HBAs as well.
I analyzed the log file. There are a lot of soft lockup reported in the log file. There are some soft lockup stacks contain lpfc driver function but the real lockup is in the SCSI layer. The lockup happened when lpfc driver call SCSI layer functions. Some soft lockup contain mpp driver functions also. The system is also reporting soft lockup in other places. From the analysis of lpfc stack traces, it does not look like the lock up is in lpfc driver. ======== May 30 11:48:07 chicken kernel: <EOI> [<ffffffff80062660>] _spin_unlock_irqrestore+0x8/0x9 May 30 11:48:07 chicken kernel: [<ffffffff88074260>] :scsi_mod:__scsi_iterate_devices+0x56/0x6f May 30 11:48:07 chicken kernel: [<ffffffff880787c8>] :scsi_mod:device_block+0x0/0x5 May 30 11:48:07 chicken kernel: [<ffffffff88074e82>] :scsi_mod:starget_for_each_device+0x6c/0x7e May 30 11:48:07 chicken kernel: [<ffffffff880783e2>] :scsi_mod:target_block+0x0/0x23 May 30 11:48:07 chicken kernel: [<ffffffff88078401>] :scsi_mod:target_block+0x1f/0x23 May 30 11:48:07 chicken kernel: [<ffffffff801a1a08>] device_for_each_child+0x1f/0x4e May 30 11:48:07 chicken kernel: [<ffffffff88131ede>] :scsi_transport_fc:fc_remote_port_delete+0xa7/0xfd May 30 11:48:07 chicken kernel: [<ffffffff8814f15e>] :lpfc:lpfc_nlp_list+0x386/0x555 May 30 11:48:07 chicken kernel: [<ffffffff8815721d>] :lpfc:lpfc_device_recov_mapped_node+0x23/0x62 May 30 11:48:07 chicken kernel: [<ffffffff881558b3>] :lpfc:lpfc_disc_state_machine+0x9b/0x14e May 30 11:48:07 chicken kernel: [<ffffffff8814b114>] :lpfc:lpfc_rscn_recovery_check+0x9b/0xd0 May 30 11:48:07 chicken kernel: [<ffffffff8814bfed>] :lpfc:lpfc_els_unsol_event+0x7b9/0x10eb May 30 11:48:07 chicken kernel: [<ffffffff88143a50>] :lpfc:lpfc_sli_process_unsol_iocb+0x9c/0x102 May 30 11:48:07 chicken kernel: [<ffffffff88147837>] :lpfc:lpfc_sli_handle_slow_ring_event+0x7ab/0x82c May 30 11:48:07 chicken kernel: [<ffffffff88150a00>] :lpfc:lpfc_do_work+0x72c/0x963 May 30 11:48:07 chicken kernel: [<ffffffff8009b666>] autoremove_wake_function+0x0/0x2e May 30 11:48:07 chicken kernel: [<ffffffff881502d4>] :lpfc:lpfc_do_work+0x0/0x963 May 30 11:48:07 chicken kernel: [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61 May 30 11:48:07 chicken kernel: [<ffffffff8003216e>] kthread+0xfe/0x132 May 30 11:48:07 chicken kernel: [<ffffffff8005bfe5>] child_rip+0xa/0x11 May 30 11:48:07 chicken kernel: [<ffffffff8009b4a3>] keventd_create_kthread+0x0/0x61 May 30 11:48:07 chicken kernel: [<ffffffff80032070>] kthread+0x0/0x132 May 30 11:48:08 chicken kernel: [<ffffffff8005bfdb>] child_rip+0x0/0x11 ======= Can LSI investigate if following stack trace is a real lock up ? ========== May 28 15:24:20 chicken kernel: <EOI> [<ffffffff8008bc62>] vprintk+0x29e/0x2ea May 28 15:24:20 chicken kernel: [<ffffffff80019cc8>] vsnprintf+0x559/0x59e May 28 15:24:20 chicken kernel: [<ffffffff80019aaa>] vsnprintf+0x33b/0x59e May 28 15:24:20 chicken kernel: [<ffffffff8008bd00>] printk+0x52/0xbd May 28 15:24:20 chicken kernel: [<ffffffff80019cc8>] vsnprintf+0x559/0x59e May 28 15:24:20 chicken kernel: [<ffffffff880ccbd0>] :mppUpper:mpp_ErrorLog+0xec/0xfc May 28 15:24:20 chicken kernel: [<ffffffff880ca8e7>] :mppUpper:mppLnx_isRemovedDevice+0x8c/0xfc May 28 15:24:20 chicken kernel: [<ffffffff880cbde9>] :mppUpper:mpp_SysdepSynchronousIo+0xec/0x1d9 May 28 15:24:20 chicken kernel: [<ffffffff880c5fbf>] :mppUpper:mpp_AnalyseSyncError+0x2db/0x316 May 28 15:24:20 chicken kernel: [<ffffffff880c60bb>] :mppUpper:mpp_SynchronousIo+0xc1/0x146 May 28 15:24:20 chicken kernel: [<ffffffff880c6a9e>] :mppUpper:mpp_SynchInquiry+0x9a/0x11d May 28 15:24:20 chicken kernel: [<ffffffff880c6be3>] :mppUpper:mpp_CheckSwVersion+0x59/0xd6 May 28 15:24:20 chicken kernel: [<ffffffff880c7d77>] :mppUpper:mpp_FailBackScan+0x160/0x346 May 28 15:24:20 chicken kernel: [<ffffffff88173ca6>] :mppVhba:mppLnx_failback_handler+0xa1/0x1cc May 28 15:24:20 chicken kernel: [<ffffffff8005bfe5>] child_rip+0xa/0x11 May 28 15:24:20 chicken kernel: [<ffffffff88173c05>] :mppVhba:mppLnx_failback_handler+0x0/0x1cc May 28 15:24:20 chicken kernel: [<ffffffff8005bfdb>] child_rip+0x0/0x11 ========== Could this lockup due to large amount of logging from the mpp driver ?
we are investigating our failover driver mpp at LSI. but there is a lot of lockups that don't involve it. May 30 11:48:08 chicken kernel: Call Trace: May 30 11:48:08 chicken kernel: <IRQ> [<ffffffff800b2c85>] softlockup_tick+0xdb/0xed May 30 11:48:08 chicken kernel: [<ffffffff800933d1>] update_process_times+0x42/0x68 May 30 11:48:08 chicken kernel: [<ffffffff8020e0de>] neigh_timer_handler+0x0/0x2fd May 30 11:48:08 chicken kernel: [<ffffffff80073d97>] smp_local_timer_interrupt+0x23/0x47 May 30 11:48:08 chicken kernel: [<ffffffff80074459>] smp_apic_timer_interrupt+0x41/0x47 May 30 11:48:08 chicken kernel: [<ffffffff8005bcc2>] apic_timer_interrupt+0x66/0x6c May 30 11:48:08 chicken kernel: [<ffffffff80140f0e>] swiotlb_unmap_sg+0x0/0x126 May 30 11:48:08 chicken kernel: [<ffffffff80092bee>] run_timer_softirq+0x12a/0x1b0 May 30 11:48:08 chicken kernel: [<ffffffff80011c19>] __do_softirq+0x5e/0xd5 May 30 11:48:08 chicken kernel: [<ffffffff80075656>] end_level_ioapic_vector+0x9/0x16 May 30 11:48:08 chicken kernel: [<ffffffff8005c330>] call_softirq+0x1c/0x28 May 30 11:48:08 chicken kernel: [<ffffffff8006a310>] do_softirq+0x2c/0x85 May 30 11:48:08 chicken kernel: [<ffffffff8006a198>] do_IRQ+0xec/0xf5 May 30 11:48:08 chicken kernel: [<ffffffff8005b649>] ret_from_intr+0x0/0xa May 30 11:48:08 chicken kernel: <EOI> May 30 11:48:08 chicken kernel: BUG: soft lockup detected on CPU#1! May 30 11:48:08 chicken kernel: May 30 11:48:08 chicken kernel: Call Trace: May 30 11:48:08 chicken kernel: <IRQ> [<ffffffff800b2c85>] softlockup_tick+0xdb/0xed May 30 11:48:08 chicken kernel: [<ffffffff800933d1>] update_process_times+0x42/0x68 May 30 11:48:08 chicken kernel: [<ffffffff80073d97>] smp_local_timer_interrupt+0x23/0x47 May 30 11:48:08 chicken kernel: [<ffffffff80074459>] smp_apic_timer_interrupt+0x41/0x47 May 30 11:48:08 chicken kernel: [<ffffffff80054f13>] mwait_idle+0x0/0x4a May 30 11:48:08 chicken kernel: [<ffffffff8005bcc2>] apic_timer_interrupt+0x66/0x6c May 30 11:48:09 chicken kernel: <EOI> [<ffffffff80054f49>] mwait_idle+0x36/0x4a May 30 11:48:09 chicken kernel: [<ffffffff80046f9c>] cpu_idle+0x95/0xb8 May 30 11:48:09 chicken kernel: [<ffffffff80073bb5>] start_secondary+0x45a/0x469 I see a lot of the smp_local_timer entries. Can redhat show us how to trace those down the kernel?
We are out of runway for 5.1, so deferring to 5.2. Can LSI/FSC/Emulex verify if this is still an issue with 5.1 Beta? we've heard reports this issue has been resolved with 5.1 Beta.
Can LSI/FSC/Emulex verify if this is still an issue with 5.1? we've heard reports this issue has been resolved with 5.1.
We haven't seen this problem so far. If there are still PRIMERGY hardware related issues, I'd like to know the model, BIOS version, etc.