Bug 240473 - [LSI-E/FSC/Emulex 5.2 bug] System Hangs due to Kernel Soft Lockup on CPU2
Summary: [LSI-E/FSC/Emulex 5.2 bug] System Hangs due to Kernel Soft Lockup on CPU2
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: i386
OS: Linux
medium
urgent
Target Milestone: ---
: ---
Assignee: Tom Coughlan
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 217105 217217 246139 296411 372911 420521 422431 422441
TreeView+ depends on / blocked
 
Reported: 2007-05-17 19:01 UTC by Abdel Sadek
Modified: 2018-10-19 23:19 UTC (History)
11 users (show)

Fixed In Version: 5.1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-12-10 16:23:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/var/log/messages file from the x64 server showing soft lockups (3.34 MB, text/plain)
2007-05-30 21:38 UTC, Abdel Sadek
no flags Details
/var/log/messages from a RHEL 5 x86 server with Qlogic hbas (70.74 KB, application/zip)
2007-06-01 17:50 UTC, Abdel Sadek
no flags Details

Description Abdel Sadek 2007-05-17 19:01:18 UTC
Description of problem
when running IO's with controller failover and drive failover from an intel x86 
server running RHEL 5 attached to an LSI array through an Emulex Fiber Channel 
HBA, the host hangs after 10 hours and reports soft lockups on CPU2 through the 
console.
the issue have also been seen on x64 and ia64 architectures.
Version-Release number of selected component (if applicable):
RHEL 5

How reproducible:


Steps to Reproduce:
1. connect a server running RHEL 5 (x86, x64 or ia64) with an Emulex (2 or 4 
Gb) HBA to an LSI 6091 array.
2. map 32 luns from the LSI array to the host and start IO's.
3. start the controller failover drive failover ( CFDF) test.
  
Actual results:
After about 10 hours, the host will freeze and the console will display the 
message:
BUG: soft lockup detected on CPU#2!
 [<c044a0b7>] softlockup_tick+0x98/0xa6
 [<c042cc98>] update_process_times+0x39/0x5c
 [<c04176ec>] smp_apic_timer_interrupt+0x5c/0x64
 [<c04049bf>] apic_timer_interrupt+0x1f/0x24
 [<c05fc388>] _spin_lock+0x7/0xf
 [<f8967706>] lpfc_intr_handler+0x3f/0x228 [lpfc]
 [<c044a2e7>] handle_IRQ_event+0x23/0x49
 [<c044a3a0>] __do_IRQ+0x93/0xe8
 [<c04063f4>] do_IRQ+0x93/0xae
 [<c040492e>] common_interrupt+0x1a/0x20
 [<c0402ccf>] mwait_idle+0x25/0x38
 [<c0402c90>] cpu_idle+0x9f/0xb9
 =======================


Expected results:
the test should pass the 24 hours run.

Additional info:
HBA Vendor: Emulex
HBA Model Name: LPe11000-M4 and LP10000
HBA driver: Emulex LightPulse Fibre Channel SCSI driver 8.1.10.3
OS: RHEL 5
Kernel: Linux timon 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:21 EST 2007 i686 i686 
i386 GNU/Linux

Comment 1 Andrius Benokraitis 2007-05-18 14:22:15 UTC
Adding Larry Troan - Larry, this was experienced with an FSC box. Can you add
the appropriate people to this BZ?

Comment 2 RHEL Program Management 2007-05-18 14:24:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Andrius Benokraitis 2007-05-21 04:09:26 UTC
Adding Emulex to this bug.

Comment 4 Abdel Sadek 2007-05-24 14:13:23 UTC
After updating the Emulex HBA driver to 8.1.10.9 and restarted the CFDF test, 
the kernel showed a panic on the x86 machine.
The console output shows the messages below:

CPU 3: Machine Check Exception: 0000000000000004
CPU 2: Machine Check Exception: 0000000000000004
Kernel panic - not syncing: Unable to continue
 <7>CPU 3: EIP: c0415921 EFLAGS: 00000293
Kernel panic - not syncing: Unable to continue



Comment 5 Abdel Sadek 2007-05-30 21:38:45 UTC
Created attachment 155751 [details]
/var/log/messages file from the x64 server showing soft lockups

Comment 6 Abdel Sadek 2007-06-01 17:50:27 UTC
Created attachment 155913 [details]
/var/log/messages from a RHEL 5 x86 server with Qlogic hbas

Comment 7 David Aquilina 2007-06-01 21:56:34 UTC
(In reply to comment #4)
> CPU 3: Machine Check Exception: 0000000000000004
> CPU 2: Machine Check Exception: 0000000000000004
> Kernel panic - not syncing: Unable to continue
>  <7>CPU 3: EIP: c0415921 EFLAGS: 00000293
> Kernel panic - not syncing: Unable to continue

MCEs are usually indicative of hardware failures of some sort. Has this been
seen on multiple systems? Is the full console log from this incident available? 


Comment 8 Abdel Sadek 2007-06-01 22:04:19 UTC
(In reply to comment #7)
> (In reply to comment #4)
> > CPU 3: Machine Check Exception: 0000000000000004
> > CPU 2: Machine Check Exception: 0000000000000004
> > Kernel panic - not syncing: Unable to continue
> >  <7>CPU 3: EIP: c0415921 EFLAGS: 00000293
> > Kernel panic - not syncing: Unable to continue
> MCEs are usually indicative of hardware failures of some sort. Has this been
> seen on multiple systems? Is the full console log from this incident 
available? 

No. This happenend only once on one particular host. We figured out that may 
have been caused by some hardware overheating. 
Now the main concern is the CPU soft lockup that's being seen on all the 
Machines. it looks like the lockup is comming from the scsi layer stack. It 
happenend when scsi layer functions are being called by other instances (like 
the hba driver or the multipth driver).


Comment 10 Abdel Sadek 2007-06-04 19:06:46 UTC
Is there any updates from RedHat on that issue? is there any more info that LSI 
can provide to get the soft lockup issue resolved? it's blocking our testing.

Comment 11 Larry Troan 2007-06-04 19:33:19 UTC
Guessing who at Fujitsu-Siemens should be added here since Anders Karlsson
handles bugs for them. This is required because this is a cross-patner bug
exhibited on FSC hardware...
* Martin.Wilck

Anders, can you add anyone else from FSC you believe should be on the access list.

Comment 12 Andrius Benokraitis 2007-06-04 20:45:18 UTC
Note to Emulex: This may need some attention from you all, the LSI array is
attached to an Emulex controller. Would you be able to have a first stab at this?

Comment 13 Abdel Sadek 2007-06-04 21:06:04 UTC
I am seeing the soft lockups on Servers with Qlogic HBAs as well.

Comment 14 Bino J Sebastian 2007-06-04 21:38:32 UTC
	I analyzed the log file. There are a lot of soft lockup
reported in the log file. 

		There are some soft lockup stacks contain lpfc
driver function but the real lockup is in the SCSI layer. The
lockup happened when lpfc driver call SCSI layer functions.

	Some soft lockup contain mpp driver functions also.

	The system is also reporting soft lockup in other places.

	From the analysis of lpfc stack traces, it does not look
like the lock up is in lpfc driver.
========
May 30 11:48:07 chicken kernel:  <EOI>  [<ffffffff80062660>] 
_spin_unlock_irqrestore+0x8/0x9
May 30 11:48:07 chicken kernel:  
[<ffffffff88074260>] :scsi_mod:__scsi_iterate_devices+0x56/0x6f
May 30 11:48:07 chicken kernel:  
[<ffffffff880787c8>] :scsi_mod:device_block+0x0/0x5
May 30 11:48:07 chicken kernel:  
[<ffffffff88074e82>] :scsi_mod:starget_for_each_device+0x6c/0x7e
May 30 11:48:07 chicken kernel:  
[<ffffffff880783e2>] :scsi_mod:target_block+0x0/0x23
May 30 11:48:07 chicken kernel:  
[<ffffffff88078401>] :scsi_mod:target_block+0x1f/0x23
May 30 11:48:07 chicken kernel:  [<ffffffff801a1a08>] 
device_for_each_child+0x1f/0x4e
May 30 11:48:07 chicken kernel:  
[<ffffffff88131ede>] :scsi_transport_fc:fc_remote_port_delete+0xa7/0xfd
May 30 11:48:07 chicken kernel:  
[<ffffffff8814f15e>] :lpfc:lpfc_nlp_list+0x386/0x555
May 30 11:48:07 chicken kernel:  
[<ffffffff8815721d>] :lpfc:lpfc_device_recov_mapped_node+0x23/0x62
May 30 11:48:07 chicken kernel:  
[<ffffffff881558b3>] :lpfc:lpfc_disc_state_machine+0x9b/0x14e
May 30 11:48:07 chicken kernel:  
[<ffffffff8814b114>] :lpfc:lpfc_rscn_recovery_check+0x9b/0xd0
May 30 11:48:07 chicken kernel:  
[<ffffffff8814bfed>] :lpfc:lpfc_els_unsol_event+0x7b9/0x10eb
May 30 11:48:07 chicken kernel:  
[<ffffffff88143a50>] :lpfc:lpfc_sli_process_unsol_iocb+0x9c/0x102
May 30 11:48:07 chicken kernel:  
[<ffffffff88147837>] :lpfc:lpfc_sli_handle_slow_ring_event+0x7ab/0x82c
May 30 11:48:07 chicken kernel:  
[<ffffffff88150a00>] :lpfc:lpfc_do_work+0x72c/0x963
May 30 11:48:07 chicken kernel:  [<ffffffff8009b666>] 
autoremove_wake_function+0x0/0x2e
May 30 11:48:07 chicken kernel:  
[<ffffffff881502d4>] :lpfc:lpfc_do_work+0x0/0x963
May 30 11:48:07 chicken kernel:  [<ffffffff8009b4a3>] 
keventd_create_kthread+0x0/0x61
May 30 11:48:07 chicken kernel:  [<ffffffff8003216e>] kthread+0xfe/0x132
May 30 11:48:07 chicken kernel:  [<ffffffff8005bfe5>] child_rip+0xa/0x11
May 30 11:48:07 chicken kernel:  [<ffffffff8009b4a3>] 
keventd_create_kthread+0x0/0x61
May 30 11:48:07 chicken kernel:  [<ffffffff80032070>] kthread+0x0/0x132
May 30 11:48:08 chicken kernel:  [<ffffffff8005bfdb>] child_rip+0x0/0x11
=======
	
Can LSI investigate if following stack trace is a real lock up ?
==========
May 28 15:24:20 chicken kernel:  <EOI>  [<ffffffff8008bc62>] vprintk+0x29e/0x2ea
May 28 15:24:20 chicken kernel:  [<ffffffff80019cc8>] vsnprintf+0x559/0x59e
May 28 15:24:20 chicken kernel:  [<ffffffff80019aaa>] vsnprintf+0x33b/0x59e
May 28 15:24:20 chicken kernel:  [<ffffffff8008bd00>] printk+0x52/0xbd
May 28 15:24:20 chicken kernel:  [<ffffffff80019cc8>] vsnprintf+0x559/0x59e
May 28 15:24:20 chicken kernel:  
[<ffffffff880ccbd0>] :mppUpper:mpp_ErrorLog+0xec/0xfc
May 28 15:24:20 chicken kernel:  
[<ffffffff880ca8e7>] :mppUpper:mppLnx_isRemovedDevice+0x8c/0xfc
May 28 15:24:20 chicken kernel:  
[<ffffffff880cbde9>] :mppUpper:mpp_SysdepSynchronousIo+0xec/0x1d9
May 28 15:24:20 chicken kernel:  
[<ffffffff880c5fbf>] :mppUpper:mpp_AnalyseSyncError+0x2db/0x316
May 28 15:24:20 chicken kernel:  
[<ffffffff880c60bb>] :mppUpper:mpp_SynchronousIo+0xc1/0x146
May 28 15:24:20 chicken kernel:  
[<ffffffff880c6a9e>] :mppUpper:mpp_SynchInquiry+0x9a/0x11d
May 28 15:24:20 chicken kernel:  
[<ffffffff880c6be3>] :mppUpper:mpp_CheckSwVersion+0x59/0xd6
May 28 15:24:20 chicken kernel:  
[<ffffffff880c7d77>] :mppUpper:mpp_FailBackScan+0x160/0x346
May 28 15:24:20 chicken kernel:  
[<ffffffff88173ca6>] :mppVhba:mppLnx_failback_handler+0xa1/0x1cc
May 28 15:24:20 chicken kernel:  [<ffffffff8005bfe5>] child_rip+0xa/0x11
May 28 15:24:20 chicken kernel:  
[<ffffffff88173c05>] :mppVhba:mppLnx_failback_handler+0x0/0x1cc
May 28 15:24:20 chicken kernel:  [<ffffffff8005bfdb>] child_rip+0x0/0x11
==========

Could this lockup due to large amount of logging from the mpp driver ?


Comment 16 Abdel Sadek 2007-06-21 21:45:49 UTC
we are investigating our failover driver mpp at LSI. but there is a lot of
lockups that don't involve it. 
May 30 11:48:08 chicken kernel: Call Trace:
May 30 11:48:08 chicken kernel:  <IRQ>  [<ffffffff800b2c85>]
softlockup_tick+0xdb/0xed
May 30 11:48:08 chicken kernel:  [<ffffffff800933d1>] update_process_times+0x42/0x68
May 30 11:48:08 chicken kernel:  [<ffffffff8020e0de>] neigh_timer_handler+0x0/0x2fd
May 30 11:48:08 chicken kernel:  [<ffffffff80073d97>]
smp_local_timer_interrupt+0x23/0x47
May 30 11:48:08 chicken kernel:  [<ffffffff80074459>]
smp_apic_timer_interrupt+0x41/0x47
May 30 11:48:08 chicken kernel:  [<ffffffff8005bcc2>] apic_timer_interrupt+0x66/0x6c
May 30 11:48:08 chicken kernel:  [<ffffffff80140f0e>] swiotlb_unmap_sg+0x0/0x126
May 30 11:48:08 chicken kernel:  [<ffffffff80092bee>] run_timer_softirq+0x12a/0x1b0
May 30 11:48:08 chicken kernel:  [<ffffffff80011c19>] __do_softirq+0x5e/0xd5
May 30 11:48:08 chicken kernel:  [<ffffffff80075656>]
end_level_ioapic_vector+0x9/0x16
May 30 11:48:08 chicken kernel:  [<ffffffff8005c330>] call_softirq+0x1c/0x28
May 30 11:48:08 chicken kernel:  [<ffffffff8006a310>] do_softirq+0x2c/0x85
May 30 11:48:08 chicken kernel:  [<ffffffff8006a198>] do_IRQ+0xec/0xf5
May 30 11:48:08 chicken kernel:  [<ffffffff8005b649>] ret_from_intr+0x0/0xa
May 30 11:48:08 chicken kernel:  <EOI> 
May 30 11:48:08 chicken kernel: BUG: soft lockup detected on CPU#1!
May 30 11:48:08 chicken kernel: 
May 30 11:48:08 chicken kernel: Call Trace:
May 30 11:48:08 chicken kernel:  <IRQ>  [<ffffffff800b2c85>]
softlockup_tick+0xdb/0xed
May 30 11:48:08 chicken kernel:  [<ffffffff800933d1>] update_process_times+0x42/0x68
May 30 11:48:08 chicken kernel:  [<ffffffff80073d97>]
smp_local_timer_interrupt+0x23/0x47
May 30 11:48:08 chicken kernel:  [<ffffffff80074459>]
smp_apic_timer_interrupt+0x41/0x47
May 30 11:48:08 chicken kernel:  [<ffffffff80054f13>] mwait_idle+0x0/0x4a
May 30 11:48:08 chicken kernel:  [<ffffffff8005bcc2>] apic_timer_interrupt+0x66/0x6c
May 30 11:48:09 chicken kernel:  <EOI>  [<ffffffff80054f49>] mwait_idle+0x36/0x4a
May 30 11:48:09 chicken kernel:  [<ffffffff80046f9c>] cpu_idle+0x95/0xb8
May 30 11:48:09 chicken kernel:  [<ffffffff80073bb5>] start_secondary+0x45a/0x469

I see a lot of the smp_local_timer entries.
Can redhat show us how to trace those down the kernel?
  

Comment 22 Andrius Benokraitis 2007-07-30 17:53:41 UTC
We are out of runway for 5.1, so deferring to 5.2.

Can LSI/FSC/Emulex verify if this is still an issue with 5.1 Beta? we've heard
reports this issue has been resolved with 5.1 Beta.

Comment 25 Andrius Benokraitis 2007-12-07 18:25:19 UTC
Can LSI/FSC/Emulex verify if this is still an issue with 5.1? we've heard
reports this issue has been resolved with 5.1.

Comment 26 Martin Wilck 2007-12-10 14:12:07 UTC
We haven't seen this problem so far. If there are still PRIMERGY hardware
related issues, I'd like to know the model, BIOS version, etc.


Note You need to log in before you can comment on or make changes to this bug.