Bug 408541 - [Emulex 5.6 bug] lpfc: System hangs when there are lot of messages printed to the console
[Emulex 5.6 bug] lpfc: System hangs when there are lot of messages printed to...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
All Linux
high Severity high
: ---
: 5.6
Assigned To: Steve Best
Red Hat Kernel QE team
: OtherQA
Depends On:
Blocks: 557597 525215
  Show dependency treegraph
 
Reported: 2007-12-03 06:04 EST by Bino J Sebastian
Modified: 2010-11-09 07:39 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-07-27 11:23:28 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Bino J Sebastian 2007-12-03 06:04:15 EST
Description:
Description of problem:
When a FC channel HBA lose connectivity to a storage for long time the HBA
driver start returning x10000 error (DID_NO_CONNECT) for every SCSI command to
allow upper layer drivers to fail over the command to other paths. For each
failed SCSI command the SCSI layer and block layer log an error message. When
there are lot of error messages logged the PowerPC system running RHEL5 or
RHEL5.1 crash with following stack trace.

smp_call_function on cpu 1: other cpus not responding (2) 
1:mon> t 
[c0000000c8ad3a10] c000000000070900 .on_each_cpu+0x24/0x88 
[c0000000c8ad3ab0] c0000000000ee128 .invalidate_bh_lrus+0x28/0x40
[c0000000c8ad3b30] c0000000000f64b4 .kill_bdev+0x34/0x60 
[c0000000c8ad3bb0] c0000000000f6e8c .__blkdev_put+0x88/0x220 
[c0000000c8ad3c50] c0000000000eca1c .__fput+0x108/0x25c 
[c0000000c8ad3d00] c0000000000e8fa4 .filp_close+0xac/0xd4 
[c0000000c8ad3d90] c0000000000eacf4 .sys_close+0xc4/0x110 
[c0000000c8ad3e30] c0000000000086a4 syscall_exit+0x0/0x40 

When this happen all the time there is atleast one cpu in printk function.

Version-Release number of selected component (if applicable):
2.6.18-53-el5

How reproducible:
100%

Steps to Reproduce:
1. Connect lpfc HBA to a SAN with atleast 32 luns 
2. Start heavy IOs to all the luns 
3. Unplug the cable from the HBA and leave it unplugged for 60 seconds.
  
Actual results:
System crash with following stack trace
1:mon> t
[c0000000c8ad3a10] c000000000070900 .on_each_cpu+0x24/0x88 
[c0000000c8ad3ab0] c0000000000ee128 .invalidate_bh_lrus+0x28/0x40
[c0000000c8ad3b30] c0000000000f64b4 .kill_bdev+0x34/0x60 
[c0000000c8ad3bb0] c0000000000f6e8c .__blkdev_put+0x88/0x220 
[c0000000c8ad3c50] c0000000000eca1c .__fput+0x108/0x25c 
[c0000000c8ad3d00] c0000000000e8fa4 .filp_close+0xac/0xd4 
[c0000000c8ad3d90] c0000000000eacf4 .sys_close+0xc4/0x110 
[c0000000c8ad3e30] c0000000000086a4 syscall_exit+0x0/0x40 

Expected results:
SCSI errors . On single pathed environments applications issueing IO will fail.
On multipathed environments multipath software failover with out any application
errors.

Additional info:
======
Please also see bugzilla 234600. This may be related.
After removing /var/log/messages line from the syslog.conf file and setting 
/proc/kernel/sys/printk to 1, this issue is not reproducible.
Comment 1 Tom Coughlan 2008-02-27 13:30:21 EST
(In reply to comment #0)

> When
> there are lot of error messages logged the PowerPC system running RHEL5 or
> RHEL5.1 crash with following stack trace.

Bino, does this appear to be specific to the PPC? Have you tried other archs? 
Comment 2 Bino J Sebastian 2008-02-27 13:55:02 EST
We have seen this issue on other architectures also. I do not think this
is PPC specific issue.
Comment 3 Rob Evers 2009-07-01 13:10:59 EDT
Hi Bino,

Is this problem still occurring with RHEL5.3?

Rob
Comment 4 Bino J Sebastian 2009-07-02 09:23:13 EDT
Emulex testing team will retest this and update the bugzilla. Which kernel version has the fix for this issue ?

-bino
Comment 5 Rob Evers 2009-07-02 10:04:41 EDT
Since this bugzilla was reported over a year ago, I am trying to confirm whether the problem still exists, or has been fixed.  I am currently unaware of a fix for this problem.

I would simply like to confirm whether the problem still exists in the latest released version of rhel, which would be 5.3.

If possible, it might be worthwhile trying the beta version of rhel5.4 as well, which is available at:

http://people.redhat.com/dzickus/el5/156.el5/

Thanks for your help, Rob
Comment 6 Bino J Sebastian 2009-07-02 12:38:23 EDT
This issue is reproduce-able with RHEL5.3
Comment 7 Andrius Benokraitis 2009-07-02 13:47:00 EDT
At this point in the RHEL 5.4 dev cycle, I'd say unless we have a patch in-hand ASAP, this will have to be deferred to 5.5 with a possibility of 5.4.z if warranted.
Comment 8 Rob Evers 2009-10-01 15:13:13 EDT
Have not started examining this problem.
Comment 10 Rob Evers 2010-01-04 14:45:38 EST
No luck reproducing this using recent rhel5.  Used 41 luns and dt using the following command to the luns:

echo "start dt"; date; ./dt log=mpath0p1.log of=./mnts/mpath0p1/test_file limit=$2 bs=256k procs=2 flags=direct disable=pstats runtime=$1; echo "end dt"; date

Varied limit between 1M and several megabytes.

The system I'm using is a somewhat older hp-dl585 with 8 cpus running at 1.8 GHz.  The system has 32 Gb of memory.  The lpfc adapter as shown by lspci:

07:0a.0 Fibre Channel: Emulex Corporation Thor LightPulse Fibre Channel Host Adapter (rev 01)

Multipath.conf is all commented out save user-friendly names.  Multipathd was running.

A single link is attached to the storage arrays though redundant paths exist.

Please provide other information related to the setup that reproduces this problem.

Thanks, Rob
Comment 11 Bino J Sebastian 2010-01-18 13:23:06 EST
Have you tried this on a PowerPC server ?
Also, the test require error injection scripts to force fail over between paths.

-bino
Comment 13 Andrius Benokraitis 2010-02-01 15:48:35 EST
Bino - I don't think we have ready access to a ppc box. Is this only happening on ppc? Has IBM been involved with this per chance?
Comment 14 Rob Evers 2010-02-03 17:55:08 EST
Linda,

Can someone on your team follow up on this and try to get a dump.  Initial bug entry indicates a problem with printing kernel messages.  I was not able reproduce this.

Rob
Comment 15 Steve Best 2010-02-12 13:44:24 EST
Bino,

does this happen on 5.5? and if possible could you get a dump.. like Rob has suggested.

Thanks,
Steve
Comment 17 Andrius Benokraitis 2010-02-17 15:15:06 EST
Unless a patch is in-hand this week, I don't see this making RHEL 5.5. I'd like to propose this get deferred to RHEL 5.6 anyways, sounds like like we don't want to mess around with this kernel area post-Beta.
Comment 19 Andrius Benokraitis 2010-02-22 10:53:25 EST
Deferring to RHEL 5.6 due to lack of feedback and out of time for 5.5.
Comment 20 Andrius Benokraitis 2010-07-27 11:23:28 EDT
CLosing due to lack of reporter feedback.

Note You need to log in before you can comment on or make changes to this bug.