Bug 599487 - [NetApp 5.6 bug] Emulex FC ports offlined on RHEL 5.5 during target controller faults
Summary: [NetApp 5.6 bug] Emulex FC ports offlined on RHEL 5.5 during target controlle...
Keywords:
Status: CLOSED DUPLICATE of bug 655119
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5.z
Hardware: All
OS: Linux
high
urgent
Target Milestone: rc
: 5.6
Assignee: Rob Evers
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 557597
TreeView+ depends on / blocked
 
Reported: 2010-06-03 10:45 UTC by Rajashekhar M A
Modified: 2018-12-06 14:35 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-11-24 21:42:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
syslog when a port went offline (4.00 MB, application/x-gzip)
2010-06-03 10:55 UTC, Rajashekhar M A
no flags Details
/var/log/messages with enhanced Emulex logging (591.45 KB, application/octet-stream)
2010-06-10 14:17 UTC, Martin George
no flags Details
lpfc patch for RHEL5.5 driver for heartbeat fix (3.93 KB, patch)
2010-07-28 21:59 UTC, James Smart
no flags Details | Diff
Kernel panic hit with RHEL 5.5 Emulex driver v8.2.0.63.3p (313.79 KB, application/x-zip-compressed)
2010-08-11 09:39 UTC, Martin George
no flags Details
.config file for the kernel-2.6.18-194.11.1.el5 (65.35 KB, application/octet-stream)
2010-08-13 15:08 UTC, Martin George
no flags Details
/var/log/messages displaying the adapter heartbeat failure (940.34 KB, application/x-zip-compressed)
2010-08-13 21:10 UTC, Martin George
no flags Details
messages files (messages_GA and messages_11.1) with SCSI verbosity enabled (4.47 MB, application/x-gzip)
2010-08-20 12:09 UTC, Rajashekhar M A
no flags Details
heartbeat timeout patch (4.85 KB, patch)
2010-08-25 14:44 UTC, Richard Kennedy
no flags Details | Diff
Console logs & /var/log/messages for the above panic scenario (46.12 KB, application/x-zip-compressed)
2010-09-06 20:28 UTC, Martin George
no flags Details

Description Rajashekhar M A 2010-06-03 10:45:54 UTC
Description of problem:

Emulex HBA FC port goes offline when there are controller faults with the following messages in syslog -

Jun  1 17:08:20 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:0459 Adapter heartbeat failure, taking this port offline.
Jun  1 17:08:20 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x14 cannot issue Data: xd00 x2
Jun  1 17:08:20 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x23 cannot issue Data: xd00 x2
Jun  1 17:08:20 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x14 cannot issue Data: xd00 x2
Jun  1 17:08:20 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x14 cannot issue Data: xd00 x2


After a while, we see the following call trace in syslog:


Jun  1 17:16:55 IBMx346-200-114 kernel: end_request: I/O error, dev sdao, sector 0
Jun  1 17:16:55 IBMx346-200-114 kernel: sd 1:0:0:25: SCSI error: return code = 0x00010000
Jun  1 17:16:55 IBMx346-200-114 kernel: end_request: I/O error, dev sdap, sector 0
Jun  1 17:16:55 IBMx346-200-114 kernel: irq 193: nobody cared (try booting with the "irqpoll" option)
Jun  1 17:16:55 IBMx346-200-114 kernel:
Jun  1 17:16:55 IBMx346-200-114 kernel: Call Trace:
Jun  1 17:16:55 IBMx346-200-114 kernel:  <IRQ>  [<ffffffff800bb7b3>] __report_bad_irq+0x30/0x7d
Jun  1 17:16:55 IBMx346-200-114 kernel:  [<ffffffff800bb9e6>] note_interrupt+0x1e6/0x227
Jun  1 17:16:55 IBMx346-200-114 kernel:  [<ffffffff800baee2>] __do_IRQ+0xbd/0x103
Jun  1 17:16:55 IBMx346-200-114 kernel:  [<ffffffff800123da>] __do_softirq+0x89/0x133
Jun  1 17:16:55 IBMx346-200-114 kernel:  [<ffffffff8006ca11>] do_IRQ+0xe7/0xf5
Jun  1 17:16:55 IBMx346-200-114 kernel:  [<ffffffff80057289>] mwait_idle+0x0/0x4a
Jun  1 17:16:55 IBMx346-200-114 kernel:  [<ffffffff8005d615>] ret_from_intr+0x0/0xa
Jun  1 17:16:55 IBMx346-200-114 kernel:  <EOI>  [<ffffffff800572bf>] mwait_idle+0x36/0x4a
Jun  1 17:16:55 IBMx346-200-114 kernel:  [<ffffffff80049477>] cpu_idle+0x95/0xb8
Jun  1 17:16:55 IBMx346-200-114 kernel:  [<ffffffff80405807>] start_kernel+0x220/0x225
Jun  1 17:16:55 IBMx346-200-114 kernel:  [<ffffffff8040522f>] _sinittext+0x22f/0x236
Jun  1 17:16:55 IBMx346-200-114 kernel:
Jun  1 17:16:55 IBMx346-200-114 kernel: handlers:
Jun  1 17:16:55 IBMx346-200-114 kernel: [<ffffffff880d8a50>] (lpfc_sli_intr_handler+0x0/0x15e [lpfc])
Jun  1 17:16:55 IBMx346-200-114 kernel: Disabling IRQ #193
Jun  1 17:17:01 IBMx346-200-114 kernel: sd 1:0:0:2: SCSI error: return code = 0x00010000
Jun  1 17:17:01 IBMx346-200-114 kernel: end_request: I/O error, dev sdm, sector 0
Jun  1 17:17:01 IBMx346-200-114 multipathd: sdm: directio checker reports path is down


Version-Release number of selected component (if applicable):
kernel-2.6.18-194.3.1.el5
device-mapper-multipath-0.4.7-34.el5_5.1
Emulex FC HBA Driver version: 0:8.2.0.63.3p (Inbox drivers)
HBA: Emulex LPe12002-M8 8Gb 2-port PCIe Fibre Channel Adapter
Firmware revision: 1.11A5 (U3D1.11A5), sli-3


How reproducible:
Intermittent. Seen twice while testing, where a port goes "offline" and corresponding paths to LUNs on the controllers are lost.


Steps to Reproduce:
1. Configure LUNs on a controller and map them to the Host with Emulex HBAs.
2. Configure multipath and start IO.
3. Run controller faults which make some of the target ports go down.

  
Actual results:
On the host, one of the ports goes offline with a call trace in syslog and it never recovers from the state. Since the host HBA port goes offline, all the paths through this port to the targets are lost.


Expected results:
Ports should stay online and serve data when faults are not present.


Additional info:
- Attached is the /var/log/messages file, where we can see the call trace at Jun  1 17:16:55
- For HBA and the driver parameters, default settings were used.

Comment 1 Rajashekhar M A 2010-06-03 10:55:47 UTC
Created attachment 419331 [details]
syslog when a port went offline

Attaching the syslog messages file.

Comment 2 Martin George 2010-06-03 11:46:38 UTC
Looks similar to RHEL 5.4 bug 549906

Comment 3 Martin George 2010-06-09 08:04:18 UTC
Any updates on this?

Comment 4 Martin George 2010-06-09 08:07:13 UTC
Any updates on this?

Comment 5 Andrius Benokraitis 2010-06-09 15:09:43 UTC
So did this work in 5.5 GA and then broke in a 5.5.z? When was the last time this worked?

Comment 6 Rajashekhar M A 2010-06-10 07:18:05 UTC
No, this was seen with 5.5 GA also.

Comment 7 Andrius Benokraitis 2010-06-10 13:50:26 UTC
When was the last time this worked?

Comment 8 Martin George 2010-06-10 14:13:49 UTC
(In reply to comment #7)
> When was the last time this worked?    

Seems this worked fine in the RHEL 5.4 errata kernel v2.6.18-179.el5 as described in bug 516541.

Comment 9 Martin George 2010-06-10 14:17:07 UTC
Created attachment 422926 [details]
/var/log/messages with enhanced Emulex logging

Attaching the /var/log/messages for the port offline issue with lpfc_log_verbose set to 0xfefb.

Comment 11 Andrius Benokraitis 2010-06-10 14:23:25 UTC
This sounds like a possible issue with lpfc

Comment 12 Rob Evers 2010-06-10 17:52:29 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > When was the last time this worked?    
> 
> Seems this worked fine in the RHEL 5.4 errata kernel v2.6.18-179.el5 as
> described in bug 516541.    

In attachment 1 [details], many soft lockups are present, some directly before the lpfc adapter stops:

 Jun  1 12:29:50 IBMx346-200-114 kernel: RIP: 0010:[<ffffffff80076659>]  [<ffffffff80076659>] __smp_call_function_many+0x96/0xbc

Jun  1 12:29:50 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:0459 Adapter heartbeat failure, taking this port offline.
Jun  1 12:29:50 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x14 cannot issue Data: xd00 x2
Jun  1 12:29:50 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x23 cannot issue Data: xd00 x2
Jun  1 12:29:50 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x14 cannot issue Data: xd00 x2
Jun  1 12:29:50 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x14 cannot issue Data: xd00 x2

Are the soft lockups present in rhel 5.4 as well?

Comment 13 Rob Evers 2010-06-10 18:00:47 UTC
This may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=592018:

https://bugzilla.redhat.com/show_bug.cgi?id=592018#c1

Suggest changing lpfc_use_msi to 2 and retrying test.

Comment 14 Rob Evers 2010-06-14 18:49:17 UTC
Vaios,

Do you have any intention of reverting the default lpfc_use_msi value from 0 to 2 for rhel6 and/or rhel5.6?

It appears to be causing regressions.

Rob

Comment 15 Vaios Papadimitriou 2010-06-14 19:21:49 UTC
In the RHEL5.4 distribution release the LPFC in-box driver 8.2.0.48.2p defined by default MSI-X interrupt mechanism (lpfc_use_msi=2).

Starting w/ RHEL5.5 in-box driver though, 8.2.0.63.3p, we moved to having Int-X as the default interrupt mechanism (lpfc_use_msi=0).

The reason was we found many hardware that fails if MSI-X starts up, especially older established hardware, and INT-X was the existing mechanism that worked until now.

However it appears we are going to lose in either case.
 
a)	There have been many cases w/ older hardware that just plain fail if MSI-X starts up.
b)	There are newer systems, or virtualization cases, as shown in this BZ as well as 592018, which will only allow MSI-X use, so they will fail if INT-X is the default.
c)	In the previous implementation, MSI-X by default, we had a try MSI-X , if fail try MSI, if fail again, try INT-X – however there were some cases where this doesn't work (either the adapter seems to look as if it fails, or the MSI-X attempt gives all indication it works until the system locks up).

Given the above we decided to back off and had everything come up INT-X.

In hardware that mandates MSI-X, as it appears this is the case here, a user should use the “lpfc_use_msi=2” LPFC driver module parameter in /etc/modprobe.conf, and rebuild the initrd, so the LPFC driver will always load w/ MSI-X.

We don't anticipate at this time moving the default interrupt mechanism to MSI-X.

Comment 16 Martin George 2010-06-15 11:13:16 UTC
(In reply to comment #13)
> This may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=592018:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=592018#c1
> 
> Suggest changing lpfc_use_msi to 2 and retrying test.    

Yes, setting lpfc_use_msi to 2 seems to have helped. We no longer see the port offline issue now on the RHEL 5.5 host.

Comment 17 Martin George 2010-06-15 11:24:17 UTC
(In reply to comment #15)
> 
> In hardware that mandates MSI-X, as it appears this is the case here, a user
> should use the “lpfc_use_msi=2” LPFC driver module parameter in
> /etc/modprobe.conf, and rebuild the initrd, so the LPFC driver will always load
> w/ MSI-X.
> 
> We don't anticipate at this time moving the default interrupt mechanism to
> MSI-X.   

We hit this issue on a RHEL 5.5 IBM x346 host with motherboard BIOS v1.17.

So could you specify clearly which host architectures with the Emulex adapters require the INT-X or the MSI-X based interrupt mechanisms?

Ideally the lpfc driver itself should be intelligent enough to automatically use either the INT-X or MSI-X based interrupt mechanisms depending on the host architectures, instead of relying on the user to set the lpfc_use_msi option manually.

Comment 18 Andrius Benokraitis 2010-06-28 13:09:43 UTC
Emulex: I believe this is in your court. Sounds like NetApp is asking for additional intelligence, but not sure you are OK with that.

Comment 19 Rob Evers 2010-06-28 13:50:25 UTC
Is it possible for the lpfc driver to determine which combinations of hba and host require msi-x or non-msi-x mode and set the mode appropriately?

Otherwise, can Emulex provide the needed info Martin asked for regarding which hba/host combinations need to have msi-x mode set?

Comment 20 James Smart 2010-06-28 15:04:54 UTC
It is not possible for us to specify combinations as: a) we don't have access to all platforms, nor are all platforms going to be tested; and b) even for the platforms tested, the OEMs do not like us sharing "bad behavior", especially with specifics such as model numbers, etc. It's treated as confidential information (which really hurts with a bug like this).

The lpfc driver does do every intelligent thing it can do to test the environment. If configured for MSIX, it creates/enables the interrupt state and then generates a test interrupt to check to see the system is delivering it. If it does not function, it backs down to MSI mode, does the same type of test, and if that fails then it enables INTx mode, and tests it too. If the last test fails, we're SOL.  The problem that we've been encountering is that the system thinks it supports MSIX, and the test seems to report success, but later, once the system starts to stress the interrupt controller, or have multiple events to service simultaneously, it locks up.  We've seen many of these bugs fixed by pci "fixups" for the pci chip set or bridge.  We hit this problem enough, especially on older hardware that "can't regress", that we had to back off from msix as the default.  We're in a catch-22 as newer hardware, and things like function pass-thru with VMs, is requiring MSIX.

The lpfc driver doesn't support a reverse flow (INTx first, then msi, then msix), as most things are supposed to always support INTx, so it wouldn't achieve much. Maybe on the newer platforms, but that's still very TBD.

Comment 21 James Smart 2010-06-28 15:20:27 UTC
Re: "nor are all platforms going to be tested"...  nor can we afford (or know to test) all the different f/w releases on the different platforms.

Also, for a list of adapters vs interrupt mode:
Our CNAs only support INTx and MSIX.
lp12xxx and higher adapters (8Gig+) support MSIX, MSI, INTx.
lp98xx, lp10xxx and lp11xxx adapters support MSI and INTx.
Adapters earlier than this (which have been EOL'd) are INTx only.
If lpfc selects MSI, it will use single-vector only.

Comment 22 Martin George 2010-06-29 17:26:35 UTC
(In reply to comment #20)
> It is not possible for us to specify combinations as: a) we don't have access
> to all platforms, nor are all platforms going to be tested; and b) even for the
> platforms tested, the OEMs do not like us sharing "bad behavior", especially
> with specifics such as model numbers, etc. It's treated as confidential
> information (which really hurts with a bug like this).
> 

Well, then what do we recommend customers to do for the above issue? i.e. when do we tell a customer to enable/disable the lpfc_use_msi setting? It obviously can't be a trial & error method.

Comment 23 James Smart 2010-06-29 18:53:13 UTC
Before we get carried too deep into this, I've been suspicious of this bug not being MSIX-based, but rather based on the lpfc heartbeat implementation. I've been having the Emulex team look further into this.  It corresponds to heartbeat command getting stuck behind discovery commands which cause it to fire erroneously.  I'm guessing that moving the driver to MSIX meant moving us to an interrupt handler with less contention, thus better timing, thus the timer not firing.

I'd like to propose a quick test, if possible.  The lpfc heartbeat can be turned off via the module parameter "lpfc_enable_hba_heartbeat=0"  (it's one by default). This can be specified via modules.conf or via the command line.  If the driver can be loaded with option and the test rerun, I'd like to know the results. In the meantime, the driver team is looking closer at the heartbeat logic.

As for MSIX - I agree, not the position I want to be in.

Comment 24 Martin George 2010-07-01 14:14:45 UTC
Disabling lpfc heartbeat seems to have helped - not hit the port offline issue so far. 

So what's the impact on disabling this parameter? Is it something that can be recommended to customers as a workaround for this issue?

Comment 25 James Smart 2010-07-03 10:52:39 UTC
Same status still (w/ heartbeat off, it's still running) ?

Impact is if there ever was a catastrophic hardware error on the adapter, the driver may take longer to figure it out. If the hardware is truly in such a catastrophic state, there' no guarantee the pci bus was left happy, which will show the problem in other ways. So it's not like you won't detect this.  Given that Emulex only added this when the ASIC started adding temperature reporting (which will report before it fails), and that we have roughly zero (as far I know) reported failures due to heartbeat detected errors - the impact is purely one of risk, where I estimate the risk to be extremely low.

In the meantime, we will be looking to correct the heartbeat logic.

Comment 26 Andrius Benokraitis 2010-07-14 14:24:57 UTC
So all this being said, is this bugzilla primarily for Emulex to provide at some point? If so, will it be in the 5.6 timeframe?

Comment 27 kugesh 2010-07-20 11:01:17 UTC
Is there a release schedule from Emulex for this issue?

Comment 28 James Smart 2010-07-28 21:59:41 UTC
Created attachment 435137 [details]
lpfc patch for RHEL5.5 driver for heartbeat fix

Here's a test patch that corrects the heartbeat.

Comment 30 Martin George 2010-08-11 09:31:34 UTC
I'm not sure if this is related to the heartbeat issue, but I hit a kernel panic on a RHEL 5.5 Emulex host (lpfc driver v8.2.0.63.3p) with Mike Snitzer's ALUA debug kernel as described in 
https://bugzilla.redhat.com/show_bug.cgi?id=606259#c64 . This kernel only included a patched ALUA SCSI handler (scsi_dh_alua.c) along with some debug print messages as described in bug 619361. 

And I followed the same Emulex recommendation of turning off the adapter heartbeat parameter i.e. lpfc_enable_hba_heartbeat to 0. The panic was hit while running IO with faults on the Emulex host and it seems to indicate the Emulex driver at fault:

Kernel BUG at drivers/scsi/lpfc/lpfc_scsi.c:2206
invalid opcode: 0000 [1] SMP 
last sysfs file: /block/dm-14/dev
CPU 3 
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth
lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr
iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core
cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi
video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery
asus_acpi acpi_memhotplug ac parport_pc lp parport sg floppy tg3 pcspkr
i2c_i801 i2c_core e752x_edac edac_mc ide_cd serio_raw cdrom dm_raid45
dm_message dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac
scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod
ata_piix libata shpchp lpfc scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd
ohci_hcd ehci_hcd
Pid: 438, comm: scsi_eh_0 Not tainted 2.6.18-194.11.1.el5.alua_dbg #1
RIP: 0010:[<ffffffff880ff793>]  [<ffffffff880ff793>]
:lpfc:lpfc_abort_handler+0x58/0x33d
RSP: 0018:ffff81007e705dd0  EFLAGS: 00010246
RAX: ffff81003531a680 RBX: ffff81003531a680 RCX: ffff81007e705e90
RDX: ffff81007e705e90 RSI: ffff81003531a698 RDI: ffff81007e6eb050
RBP: ffff81007e624000 R08: ffff81007e704000 R09: 000000000000003c
R10: ffff810002390a90 R11: ffffffff880ff73b R12: 0000000000000000
R13: 0000000000000282 R14: ffff81007e401b58 R15: ffffffff800a07c0
FS:  0000000000000000(0000) GS:ffff8100026ca6c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b081fcca1d0 CR3: 0000000013601000 CR4: 00000000000006e0
Process scsi_eh_0 (pid: 438, threadinfo ffff81007e704000, task
ffff81007f949100)
Stack:  ffff81003531a680 ffff81007e6eb000 ffff81007e6eb4f8 000020023b9aca00
 ffff810000000000 0000000300000001 ffff81007e705e00 ffff81007e705e00
 0000958c9102002a 0000000000000018 ffff810000000001 ffff81007e705e28
Call Trace:
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff880791a4>] :scsi_mod:scsi_error_handler+0x290/0x4ac
 [<ffffffff88078f14>] :scsi_mod:scsi_error_handler+0x0/0x4ac
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003287b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003277d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 0f 0b 68 dd e0 11 88 c2 9e 08 4d 8b 7c 24 10 4c 3b 3c 24 0f 
RIP  [<ffffffff880ff793>] :lpfc:lpfc_abort_handler+0x58/0x33d
 RSP <ffff81007e705dd0>
 <0>Kernel panic - not syncing: Fatal exception

Comment 31 Martin George 2010-08-11 09:39:04 UTC
Created attachment 438144 [details]
Kernel panic hit with RHEL 5.5 Emulex driver v8.2.0.63.3p

Comment 32 Richard Kennedy 2010-08-13 14:31:57 UTC
The driver is crashing in the DECLARE_WAIT_QUEUE macros, can you provide your configuration for that debug kernel
static int
lpfc_abort_handler(struct scsi_cmnd *cmnd)
{
        struct Scsi_Host  *shost = cmnd->device->host;
        struct lpfc_vport *vport = (struct lpfc_vport *) shost->hostdata;
        struct lpfc_hba   *phba = vport->phba;
        struct lpfc_iocbq *iocb;
        struct lpfc_iocbq *abtsiocb;
        struct lpfc_scsi_buf *lpfc_cmd;
        IOCB_t *cmd, *icmd;
        int ret = SUCCESS;
#ifdef DECLARE_WAIT_QUEUE_HEAD_ONSTACK
        DECLARE_WAIT_QUEUE_HEAD_ONSTACK(waitq);
#else
        DECLARE_WAIT_QUEUE_HEAD(waitq);
#endif


The kernel has them define as:
#ifdef CONFIG_LOCKDEP
  83# define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \
  84        ({ init_waitqueue_head(&name); name; })
  85# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
  86        wait_queue_head_t name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name)
  87#else
  88# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name)
  89#endif
  90
Can you tell us if CONFIG_LOCKDEP=y?

Can you make sure that there is nothing defined in ths routine after the delacre defines?

Comment 33 Martin George 2010-08-13 15:07:31 UTC
CONFIG_LOCKDEP is set to 'y' itself. 

I am using the latest RHEL 5.5 Errata kernel kernel-2.6.18-194.11.1.el5 (source package available at https://rhn.redhat.com/rhn/software/packages/details/Overview.do?pid=567665). The config settings are the default ones itself.

And the only patch that I have applied on top of it is the one Mike Snitzer provided in https://bugzilla.redhat.com/show_bug.cgi?id=619361#c1 .

Comment 34 Martin George 2010-08-13 15:08:30 UTC
Created attachment 438696 [details]
.config file for the kernel-2.6.18-194.11.1.el5

Comment 35 Martin George 2010-08-13 21:08:19 UTC
(In reply to comment #28)
> Created an attachment (id=435137) [details]
> lpfc patch for RHEL5.5 driver for heartbeat fix
> 
> Here's a test patch that corrects the heartbeat.    

The test patch does not help. I still hit the port offline issue with the latest 2.6.18-194.11.1.el5 RHEL 5.5 Errata kernel (added 2 additional patches to it - the Emulex heartbeat patch described in the above comment #28 & Mike Snitzer's ALUA handler patch described at https://bugzilla.redhat.com/show_bug.cgi?id=619361#c1) during IO with faults:

Aug 14 00:42:38 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:0459 Adapter heartbeat failure, taking this port offline.
Aug 14 00:42:38 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x14 cannot issue Data: xd00 x2
Aug 14 00:42:38 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x23 cannot issue Data: xd00 x2
Aug 14 00:42:38 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x14 cannot issue Data: xd00 x2
Aug 14 00:42:38 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):2530 Mailbox command x14 cannot issue Data: xd00 x2

And this was run with the original default setting of the heartbeat parameter i.e. 'lpfc_enable_hba_heartbeat' set to 1.

Comment 36 Martin George 2010-08-13 21:10:25 UTC
Created attachment 438759 [details]
/var/log/messages displaying the adapter heartbeat failure

Comment 37 Martin George 2010-08-16 09:26:42 UTC
Since the kernel panic seems to be a different issue, I think it makes sense to track it separately. Will file a new bug for it.

Comment 38 Martin George 2010-08-16 10:12:49 UTC
Filed bug 624394 for the panic issue. 

And the current bug 599487 will be used for tracking the port offline issue alone.

Comment 39 Rajashekhar M A 2010-08-20 12:07:44 UTC
Updating the bugzilla with more SCSI mid verbosity logs.

When we tried with the GA kernel + patch in comment 28 with verbosity turned on. After running some faults, the host became unresponsive and also it seems that it mounted root as read-only. Attached are the logs (messages_GA).

The host was rebooted at "Aug 20 16:46:59".

We tried the same tests with the kernel-2.6.18-194.11.1.el5 + Patch in comment 28, we believe that we could reproduce the issue though the host became unresponsive (but not hung, since we could see messages being logged on serial console). The messages file (messages_11.1) was collected after rebooting the host. We could not collect any command outputs. In the logs we saw the message -

Aug 19 03:34:57 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:0459 Adapter heartbeat failure, taking this port offline. 

The above message comes when a port is offlined.

The verbosity was increased with the least possible log level (tp prevent the host from going unresponsive, which didn't help) -

SCSI_LOG_ERROR=0
SCSI_LOG_TIMEOUT=0
SCSI_LOG_SCAN=0
SCSI_LOG_MLQUEUE=1
SCSI_LOG_MLCOMPLETE=1
SCSI_LOG_LLQUEUE=0
SCSI_LOG_LLCOMPLETE=0
SCSI_LOG_HLQUEUE=0
SCSI_LOG_HLCOMPLETE=0
SCSI_LOG_IOCTL=0 

Please let us know if the logs were useful. If not, please let us know specific log levels to enable.

Comment 40 Rajashekhar M A 2010-08-20 12:09:38 UTC
Created attachment 439927 [details]
messages files (messages_GA and messages_11.1) with SCSI verbosity enabled

Attaching the log files.

Comment 42 Richard Kennedy 2010-08-25 14:44:24 UTC
Created attachment 440961 [details]
heartbeat timeout patch

Remove the previous patch that I gave you before applying this because the old patch is part of this one. cd to the lpfc directory and patch -p 0 < heartbeat.patch

Comment 43 Tom Coughlan 2010-09-03 14:22:34 UTC
(In reply to comment #42)
> Created attachment 440961 [details]
> heartbeat timeout patch

We are overdue for submitting 5.6 kernel patches. Can the Netapp folks try this patch, to see if it solves the problem with heartbeat enabled?

Comment 44 Martin George 2010-09-06 20:24:15 UTC
(In reply to comment #42)
> Created attachment 440961 [details]
> heartbeat timeout patch
> 
> Remove the previous patch that I gave you before applying this because the old
> patch is part of this one. cd to the lpfc directory and patch -p 0 <
> heartbeat.patch

This new patch also did not help. I hit a kernel panic at lpfc_scsi_cmd_iocb_cmpl on the Emulex host as follows:

lpfc 0000:03:00.0: 0:0310 Mailbox command x5 timeout Data: x0 x700 xffff810058e67c00
lpfc 0000:03:00.0: 0:0345 Resetting board due to mailbox timeout
lpfc 0000:03:00.0: 0:(0):2530 Mailbox command x23 cannot issue Data: xd00 x2
Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP: 
 [<ffffffff8810052d>] :lpfc:lpfc_scsi_cmd_iocb_cmpl+0x9ed/0x137d
PGD 0 
Oops: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/rport-0:0-2/target0:0:0/0:0:0:1/timeout
CPU 3 
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg floppy i2c_i801 i2c_core ide_cd tg3 cdrom pcspkr e752x_edac edac_mc serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp lpfc scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 17, comm: events/3 Not tainted 2.6.18-194.11.1.el5.lpfc.heartbeat #1
RIP: 0010:[<ffffffff8810052d>]  [<ffffffff8810052d>] :lpfc:lpfc_scsi_cmd_iocb_cmpl+0x9ed/0x137d
RSP: 0018:ffff81007ff6b948  EFLAGS: 00010286
RAX: 000000000000001e RBX: ffff81000bbf1500 RCX: 0000000000000000
RDX: ffff81007acf24c0 RSI: 0000000000000220 RDI: ffff81007acf2540
RBP: 0000000000000000 R08: ffffffff80311da8 R09: ffff810078b81188
R10: ffff81007e34fba8 R11: 000000000000000a R12: 0000000000001000
R13: ffff81007acf24c0 R14: 00000000040a0000 R15: 0000000000000016
FS:  0000000000000000(0000) GS:ffff8100026ca6c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000010 CR3: 0000000000201000 CR4: 00000000000006e0
Process events/3 (pid: 17, threadinfo ffff8100378de000, task ffff81007ff47080)
Stack:  0000000000000000 000000000000000a ffff81007e34fba8 ffff810000000000
 ffff81007ff6ba78 ffff81007e5e8000 ffffffff80071a88 0000000000000001
 ffffffff882baddb ffff81007e34e400 ffff81007e6994f8 ffff81007e69b600
Call Trace:
 <IRQ>  [<ffffffff80071a88>] nommu_map_single+0x24/0x33
 [<ffffffff882baddb>] :tg3:tg3_start_xmit_dma_bug+0x85d/0x90b
 [<ffffffff880d563d>] :lpfc:lpfc_sli_handle_fast_ring_event+0x40b/0x60f
 [<ffffffff8002f972>] dev_queue_xmit+0x250/0x271
 [<ffffffff80031f5a>] ip_output+0x29a/0x2dd
 [<ffffffff80046e4c>] try_to_wake_up+0x472/0x484
 [<ffffffff8003d9f4>] lock_timer_base+0x1b/0x3c
 [<ffffffff8026ada7>] fn_hash_lookup+0x79/0xb2
 [<ffffffff8015081f>] __next_cpu+0x19/0x28
 [<ffffffff880d58df>] :lpfc:lpfc_sli_fp_intr_handler+0x9e/0x107
 [<ffffffff880d8bbb>] :lpfc:lpfc_sli_intr_handler+0x122/0x15e
 [<ffffffff80010bab>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800bae28>] __do_IRQ+0xa4/0x103
 [<ffffffff8006ca11>] do_IRQ+0xe7/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 [<ffffffff80064b50>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff800efc7a>] aio_complete+0x1ef/0x1fd
 [<ffffffff800f44c8>] dio_bio_end_aio+0x9f/0xbf
 [<ffffffff8002cc88>] __end_that_request_first+0x23c/0x5bf
 [<ffffffff8005c17b>] blk_run_queue+0x28/0x73
 [<ffffffff88079fe5>] :scsi_mod:scsi_end_request+0x27/0xcd
 [<ffffffff8807a1d9>] :scsi_mod:scsi_io_completion+0x14e/0x324
 [<ffffffff880a7802>] :sd_mod:sd_rw_intr+0x252/0x28c
 [<ffffffff8807a46e>] :scsi_mod:scsi_device_unbusy+0x67/0x81
 [<ffffffff800dca7c>] cache_reap+0x0/0x217
 [<ffffffff80037c1d>] blk_done_softirq+0x5f/0x6d
 [<ffffffff800123b4>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cb8e>] do_softirq+0x2c/0x85
 [<ffffffff8006ca16>] do_IRQ+0xec/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff800dcb16>] cache_reap+0x9a/0x217
 [<ffffffff8004d624>] run_workqueue+0x94/0xe4
 [<ffffffff80049e5f>] worker_thread+0x0/0x122
 [<ffffffff80049f4f>] worker_thread+0xf0/0x122
 [<ffffffff8008cfa1>] default_wake_function+0x0/0xe
 [<ffffffff8003287b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8003277d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 48 8b 45 10 49 89 45 3c 48 8b 45 18 49 89 45 44 8a 83 c2 00 
RIP  [<ffffffff8810052d>] :lpfc:lpfc_scsi_cmd_iocb_cmpl+0x9ed/0x137d
 RSP <ffff81007ff6b948>
CR2: 0000000000000010
 <0>Kernel panic - not syncing: Fatal exception

Comment 45 Martin George 2010-09-06 20:28:49 UTC
Created attachment 443360 [details]
Console logs & /var/log/messages for the above panic scenario

lpfc log verbose was set to 0x1004 in the above logs.

Comment 46 Martin George 2010-09-07 14:30:24 UTC
Given all the open issues in the lpfc heartbeat handling, wouldn't it be better if this could be turned off by default in the upcoming 5.5.z and 5.6 releases? Backed by comment #25, that seems to be a safe suggestion.

Comment 47 Richard Kennedy 2010-09-08 17:23:13 UTC
This not the same crash as before. The previous crash (bugzilla 624394) failed in lpfc_abort_handler because the host_scribble pointer was nullled out so we could not find our IO related structures.

This crash is in lpfc_send_scsi_error_event, we did not check the pnode pointer before derefrencing it.
 memcpy(&fast_path_evt->un.check_cond_evt.scsi_event.wwpn,
                        &pnode->nlp_portname, sizeof(struct lpfc_name));
This should be a separate bugzilla.


As for the mailbox timeout the console shows three stuck cpu 0 warnings prior to the mailbox timeout, see below. I cannot tell why cpu0 is stuck but the math works out to be 30 seconds which is our mailbox timeout value.
Would it be possible to turn sysrq on, (echo 1 > /proc/sys/kernel/sysrq) and echo t > /proc/sysrq-trigger followed by a echo d > /proc/sysrq-trigger and a echo l > /proc/sysrq-trigger when you see the 1st stuck cpu warning?

I think something is locking up cpu 0 and preventing us from completing the mailbox command.


end_request: I/O error, dev sdcb, sector 5265384
BUG: soft lockup - CPU#0 stuck for 10s! [multipathd:9266]
CPU 0:
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg floppy i2c_i801 i2c_core ide_cd tg3 cdrom pcspkr e752x_edac edac_mc serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp lpfc scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 9266, comm: multipathd Not tainted 2.6.18-194.11.1.el5.lpfc.heartbeat #1
RIP: 0010:[<ffffffff8007664e>]  [<ffffffff8007664e>] __smp_call_function_many+0x96/0xbc
RSP: 0018:ffff81006fd07bf8  EFLAGS: 00000297
RAX: 0000000000000002 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 00000000000000ff RSI: 00000000000000ff RDI: 00000000000000c0
RBP: ffff81007dec7500 R08: 0000000000000004 R09: 000000000000003c
R10: ffff81006fd07b98 R11: 302041203432323a R12: ffffffff80022176
R13: 0000000000000292 R14: 000000001b001940 R15: ffff810079bc2588
FS:  0000000041ded940(0063) GS:ffffffff803ca000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000001b001940 CR3: 00000000761a0000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff80076750>] smp_call_function_many+0x38/0x4c
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff80076841>] smp_call_function+0x4e/0x5e
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff8819d4ad>] :dm_mod:dev_wait+0x0/0x83
 [<ffffffff800946f5>] on_each_cpu+0x10/0x22
 [<ffffffff800d1292>] __remove_vm_area+0x2b/0x42
 [<ffffffff800d12c1>] remove_vm_area+0x18/0x25
 [<ffffffff800d1315>] __vunmap+0x47/0xed
 [<ffffffff8819deff>] :dm_mod:ctl_ioctl+0x237/0x25b
 [<ffffffff800420f2>] do_ioctl+0x55/0x6b
 [<ffffffff80030175>] vfs_ioctl+0x457/0x4b9
 [<ffffffff800a411e>] sys_futex+0x10a/0x12b
 [<ffffffff8004c5a4>] sys_ioctl+0x59/0x78
 [<ffffffff8005d116>] system_call+0x7e/0x83

BUG: soft lockup - CPU#0 stuck for 10s! [multipathd:9266]
CPU 0:
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg floppy i2c_i801 i2c_core ide_cd tg3 cdrom pcspkr e752x_edac edac_mc serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp lpfc scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 9266, comm: multipathd Not tainted 2.6.18-194.11.1.el5.lpfc.heartbeat #1
RIP: 0010:[<ffffffff8007664e>]  [<ffffffff8007664e>] __smp_call_function_many+0x96/0xbc
RSP: 0018:ffff81006fd07bf8  EFLAGS: 00000297
RAX: 0000000000000002 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 00000000000000ff RSI: 00000000000000ff RDI: 00000000000000c0
RBP: ffff81007dec7500 R08: 0000000000000004 R09: 000000000000003c
R10: ffff81006fd07b98 R11: 302041203432323a R12: ffffffff80022176
R13: 0000000000000292 R14: 000000001b001940 R15: ffff810079bc2588
FS:  0000000041ded940(0063) GS:ffffffff803ca000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000001b001940 CR3: 00000000761a0000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff80076750>] smp_call_function_many+0x38/0x4c
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff80076841>] smp_call_function+0x4e/0x5e
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff8819d4ad>] :dm_mod:dev_wait+0x0/0x83
 [<ffffffff800946f5>] on_each_cpu+0x10/0x22
 [<ffffffff800d1292>] __remove_vm_area+0x2b/0x42
 [<ffffffff800d12c1>] remove_vm_area+0x18/0x25
 [<ffffffff800d1315>] __vunmap+0x47/0xed
 [<ffffffff8819deff>] :dm_mod:ctl_ioctl+0x237/0x25b
 [<ffffffff800420f2>] do_ioctl+0x55/0x6b
 [<ffffffff80030175>] vfs_ioctl+0x457/0x4b9
 [<ffffffff800a411e>] sys_futex+0x10a/0x12b
 [<ffffffff8004c5a4>] sys_ioctl+0x59/0x78
 [<ffffffff8005d116>] system_call+0x7e/0x83

lpfc 0000:03:00.1: 1:(0):0730 FCP command x2a failed: x2 SNS xf0000600 x2a060000 Data: xa x1000 x16 x0 x0
BUG: soft lockup - CPU#0 stuck for 10s! [multipathd:9266]
CPU 0:
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg floppy i2c_i801 i2c_core ide_cd tg3 cdrom pcspkr e752x_edac edac_mc serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp lpfc scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 9266, comm: multipathd Not tainted 2.6.18-194.11.1.el5.lpfc.heartbeat #1
RIP: 0010:[<ffffffff8007664e>]  [<ffffffff8007664e>] __smp_call_function_many+0x96/0xbc
RSP: 0018:ffff81006fd07bf8  EFLAGS: 00000297
RAX: 0000000000000002 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 00000000000000ff RSI: 00000000000000ff RDI: 00000000000000c0
RBP: ffff81007dec7500 R08: 0000000000000004 R09: 000000000000003c
R10: ffff81006fd07b98 R11: 302041203432323a R12: ffffffff80022176
R13: 0000000000000292 R14: 000000001b001940 R15: ffff810079bc2588
FS:  0000000041ded940(0063) GS:ffffffff803ca000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000001b001940 CR3: 00000000761a0000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff80076750>] smp_call_function_many+0x38/0x4c
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff80076841>] smp_call_function+0x4e/0x5e
 [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
 [<ffffffff8819d4ad>] :dm_mod:dev_wait+0x0/0x83
 [<ffffffff800946f5>] on_each_cpu+0x10/0x22
 [<ffffffff800d1292>] __remove_vm_area+0x2b/0x42
 [<ffffffff800d12c1>] remove_vm_area+0x18/0x25
 [<ffffffff800d1315>] __vunmap+0x47/0xed
 [<ffffffff8819deff>] :dm_mod:ctl_ioctl+0x237/0x25b
 [<ffffffff800420f2>] do_ioctl+0x55/0x6b
 [<ffffffff80030175>] vfs_ioctl+0x457/0x4b9
 [<ffffffff800a411e>] sys_futex+0x10a/0x12b
 [<ffffffff8004c5a4>] sys_ioctl+0x59/0x78
 [<ffffffff8005d116>] system_call+0x7e/0x83

lpfc 0000:03:00.0: 0:0310 Mailbox command x31 timeout Data: x20 x700 xffff810058e67c00
lpfc 0000:03:00.0: 0:0345 Resetting board due to mailbox timeout
lpfc 0000:03:00.0: 0:(0):2530 Mailbox command x14 cannot issue Data: xd00 x2

Comment 48 Rob Evers 2010-10-11 15:09:20 UTC
Can Emulex draft a release note for this issue and provide a patch to turn off heartbeats by default provided this is still required after recent rhel5.6 lpfc updates

Thanks, Rob

Comment 49 Rob Evers 2010-10-27 21:05:29 UTC
(In reply to comment #48)
> Can Emulex draft a release note for this issue and provide a patch to turn off
> heartbeats by default provided this is still required after recent rhel5.6 lpfc
> updates
> 
> Thanks, Rob

Vaios,

Can you or someone at Emulex draft a release note for this issue and include it as a comment?  I'll get it into the release notes.

Thanks, Rob

Comment 50 Rob Evers 2010-11-15 20:53:28 UTC
Still looking for a patch from Emulex to turn off heartbeats for rhel5.6.  This will then be backported to rhel5.5 z-stream.

Is this in the works or did I miss something?

Thanks, Rob

Comment 51 Rob Evers 2010-11-16 21:26:49 UTC
(In reply to comment #50)
> Still looking for a patch from Emulex to turn off heartbeats for rhel5.6.  This
> will then be backported to rhel5.5 z-stream.
> 
> Is this in the works or did I miss something?
> 
> Thanks, Rob

Cancelling request for patch.  I drafted a release note to resolve this.

Hopefully a fully baked fix can be found for rhel5.7.

Comment 52 Rob Evers 2010-11-16 21:26:50 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The lpfc_enable_hba_heartbeat parameter should be changed from its default value of enabled to disabled:

Recommend lpfc_enable_hba_heartbeat=0

The result of not doing so can result in a problem where Emulex FC ports are offlined during target controller faults.

Comment 54 Rob Evers 2010-11-16 23:27:15 UTC
Vaios,

Before we go down the release note, rhel5.7 route, any chance you will be providing an update to turn off heartbeats by default?

There isn't much time left at all for a patch to do this for rhel5.6.

Rob

Comment 55 Vaios Papadimitriou 2010-11-17 18:17:14 UTC
Yes, we will be providing a LPFC driver update to turn off the heartbeat timer by default.

We are in the process of building the LPFC driver patches for the corresponding upstream submissions. When these patches have been pushed upstream we will be submitting the corresponding patches for RHEL5.6. I anticipate this to be done by Monday 11/22/10.

Comment 56 Rob Evers 2010-11-18 14:20:47 UTC
Deleted Technical Notes Contents.

Old Contents:
The lpfc_enable_hba_heartbeat parameter should be changed from its default value of enabled to disabled:

Recommend lpfc_enable_hba_heartbeat=0

The result of not doing so can result in a problem where Emulex FC ports are offlined during target controller faults.

Comment 57 Andrius Benokraitis 2010-11-19 19:58:38 UTC
so can this be closed as a dupe of bug 655119 that contains this fix?

Comment 58 Rob Evers 2010-11-24 21:32:56 UTC
(In reply to comment #57)
> so can this be closed as a dupe of bug 655119 that contains this fix?

yes.

Comment 60 Rob Evers 2010-11-24 21:42:11 UTC

*** This bug has been marked as a duplicate of bug 655119 ***


Note You need to log in before you can comment on or make changes to this bug.