234600 – [Emulex 5.2 bug] Soft lockup detected when FC storage array is disconnected while IO running

Bug 234600 - [Emulex 5.2 bug] Soft lockup detected when FC storage array is disconnected while IO running

Summary: [Emulex 5.2 bug] Soft lockup detected when FC storage array is disconnected w...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Mike Christie
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	217217 425461
TreeView+	depends on / blocked

Reported:	2007-03-30 14:31 UTC by Bino J Sebastian
Modified:	2009-06-19 16:27 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-03-15 03:53:51 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Bino J Sebastian 2007-03-30 14:31:20 UTC

Description of problem:
Soft lock up detect while unplugging the HBA cable when IO
running. Following is the stack trace
Mar 9 16:44:39 linux3 kernel: BUG: soft lockup detected on CPU#0! 
Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b 
Mar 9 16:44:39 linux3 kernel: [<c044a0b7>] softlockup_tick+0x98/0xa6 
Mar 9 16:44:39 linux3 kernel: [<c042cc98>] update_process_times+0x39/0x5c 
Mar 9 16:44:39 linux3 kernel: [<c04176ec>] smp_apic_timer_interrupt+0x5c/0x64 
Mar 9 16:44:39 linux3 kernel: [<c04049bf>] apic_timer_interrupt+0x1f/0x24 
Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b 
Mar 9 16:44:39 linux3 kernel: [<c0415921>] smp_call_function+0x99/0xc3 
Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b 
Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b 
Mar 9 16:44:39 linux3 kernel: [<c0427b46>] on_each_cpu+0x10/0x1f 
Mar 9 16:44:39 linux3 kernel: [<c046c9c3>] invalidate_bdev+0x23/0x2e 
Mar 9 16:44:39 linux3 kernel: [<c0470bf8>] kill_bdev+0xd/0x20 
Mar 9 16:44:39 linux3 kernel: [<c0471097>] __blkdev_put+0x3b/0x123 
Mar 9 16:44:39 linux3 kernel: [<c046b623>] __fput+0x9c/0x167 
Mar 9 16:44:39 linux3 kernel: [<c046910f>] filp_close+0x4e/0x54 
Mar 9 16:44:39 linux3 kernel: [<c0403eff>] syscall_call+0x7/0xb 


Version-Release number of selected component (if applicable):
2.6.18-8

How reproducible:
We saw this only once in Emulex lab. We are trying to reproduce this
issue.

Steps to Reproduce:
1. Storage configuration LP11002 -- brocade 4 gig ----- Hitachi array
2. Start IO using dt on sd devices of the Hitachi array.
3. While running IO unplug FC cable connected to HBA
4. Leave the cable unplugged for 30 seconds (dev_loss_tmo). 
5. Reattach the cable. Check log file and the soft lockup bug reported in the 
   log file. 


Actual results:
Soft lock up detect while unplugging the HBA cable when IO
running. Following is the stack trace
Mar 9 16:44:39 linux3 kernel: BUG: soft lockup detected on CPU#0! 
Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b 
Mar 9 16:44:39 linux3 kernel: [<c044a0b7>] softlockup_tick+0x98/0xa6 
Mar 9 16:44:39 linux3 kernel: [<c042cc98>] update_process_times+0x39/0x5c 
Mar 9 16:44:39 linux3 kernel: [<c04176ec>] smp_apic_timer_interrupt+0x5c/0x64 
Mar 9 16:44:39 linux3 kernel: [<c04049bf>] apic_timer_interrupt+0x1f/0x24 
Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b 
Mar 9 16:44:39 linux3 kernel: [<c0415921>] smp_call_function+0x99/0xc3 
Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b 
Mar 9 16:44:39 linux3 kernel: [<c046bece>] invalidate_bh_lru+0x0/0x3b 
Mar 9 16:44:39 linux3 kernel: [<c0427b46>] on_each_cpu+0x10/0x1f 
Mar 9 16:44:39 linux3 kernel: [<c046c9c3>] invalidate_bdev+0x23/0x2e 
Mar 9 16:44:39 linux3 kernel: [<c0470bf8>] kill_bdev+0xd/0x20 
Mar 9 16:44:39 linux3 kernel: [<c0471097>] __blkdev_put+0x3b/0x123 
Mar 9 16:44:39 linux3 kernel: [<c046b623>] __fput+0x9c/0x167 
Mar 9 16:44:39 linux3 kernel: [<c046910f>] filp_close+0x4e/0x54 
Mar 9 16:44:39 linux3 kernel: [<c0403eff>] syscall_call+0x7/0xb 


Expected results:
Rediscovery of all FC luns with no softlockup.

Additional info:

Comment 1 Andrius Benokraitis 2007-04-04 20:59:45 UTC

Added to the RHEL 5.1 prioritization list.

Comment 2 RHEL Program Management 2007-04-25 20:15:53 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Andrius Benokraitis 2007-06-05 19:26:57 UTC

From Bino via email:

Andrius,
	The stack trace of bug 234600 is also present in the log file of the bug
240473. But there are other stacks also present in bug 240473 log file. This
could be due to the multipathing driver running in bug 240473 test environment.

	There is a good probability that the root cause of these two issues are same.

-bino

Comment 6 Andrius Benokraitis 2007-07-30 17:51:13 UTC

Out of runway for 5.1 - deferring to 5.2.

Comment 7 Andrius Benokraitis 2007-07-30 17:52:14 UTC

Bino - are you still experiencing this with 5.1 Beta? We've heard from other
partners and customers that this may have gone away between 5.0 and 5.1.

Comment 8 Bino J Sebastian 2007-12-03 11:11:47 UTC

We are seeing a similar issueon PPC system with RHEL5.1 following is the
stack trace from Power PC system. See also Bugzilla 234600. This bugzilla
might be related to this issue.

smp_call_function on cpu 1: other cpus not responding (2) 
1:mon> t 
[c0000000c8ad3a10] c000000000070900 .on_each_cpu+0x24/0x88 
[c0000000c8ad3ab0] c0000000000ee128 .invalidate_bh_lrus+0x28/0x40
[c0000000c8ad3b30] c0000000000f64b4 .kill_bdev+0x34/0x60 
[c0000000c8ad3bb0] c0000000000f6e8c .__blkdev_put+0x88/0x220 
[c0000000c8ad3c50] c0000000000eca1c .__fput+0x108/0x25c 
[c0000000c8ad3d00] c0000000000e8fa4 .filp_close+0xac/0xd4 
[c0000000c8ad3d90] c0000000000eacf4 .sys_close+0xc4/0x110 
[c0000000c8ad3e30] c0000000000086a4 syscall_exit+0x0/0x40

Comment 9 Bino J Sebastian 2007-12-03 20:05:40 UTC

Bugzilla number in the above comment is wrong.
Please also see Bugzilla 408541.

Comment 12 Tim Mooney 2008-01-30 21:59:43 UTC

There are enough differences so that these may not be related, but note that
we're seeing a soft-lockup in 5.1 when trying to unmount a volume that's gone
readonly after an error.  I reported it in bug #429054 , see that for more info.

In our situation, the device (/dev/md7) is a RAID1 made up of two volumes, each
coming off a different SAN array.  We're using QLogic cards, though, not Emulex.

Comment 14 Andrius Benokraitis 2008-03-07 18:33:08 UTC

Emulex - have you tested this on the latest 5.2 Beta RC bits to see if this
still persists?

Comment 17 Andrius Benokraitis 2008-03-15 03:53:51 UTC

Apparently there has been a workaround discovered by Emulex by doing the following:

"The default syslog.conf file for RHEL makes all the printk statements sync with
/var/log/messages file. After I changed the syslog.conf file to not to sync the
printk with /var/log/messages file, the issue was not re-produceable."

If this continues to be a problem, please re-open this bugzilla.

Note You need to log in before you can comment on or make changes to this bug.