Bug 516541 - [NetApp 5.5 bug] Emulex FC ports on RHEL 5.4 GA offlined during target controller faults
Summary: [NetApp 5.5 bug] Emulex FC ports on RHEL 5.4 GA offlined during target contro...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: 5.5
Assignee: Rob Evers
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On: 529244
Blocks: 525215 533192 533941 549906
TreeView+ depends on / blocked
 
Reported: 2009-08-10 11:10 UTC by Naveen Reddy
Modified: 2023-09-14 01:17 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 06:58:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Adding /var/log/messages file (during the port offlining). (50.11 KB, application/octet-stream)
2009-08-10 11:12 UTC, Naveen Reddy
no flags Details
Attaching the /var/log/messages with more verbosity (1.08 MB, message/gzip)
2009-08-12 05:56 UTC, Naveen Reddy
no flags Details
LPFC 8.2.0.48.2p to 8.2.0.48.3p patch (2.19 KB, patch)
2009-11-13 20:45 UTC, Vaios Papadimitriou
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description Naveen Reddy 2009-08-10 11:10:01 UTC
Description:

On RHEL5.4 Snapshot5 host with Emulex FC HBA, after running I/O faults on netapp controllers the HBA ports are going offline.
Adapter hearbeat failure messages are displayed as follows.

Aug 10 13:03:42 IBMx336-200-133 kernel: lpfc 0000:02:00.1: 1:(0):0203 Devloss timeout on WWPN 50:0a:09:83:89:ba:e6:0b NPort x670500 Data: x0 x7 x0
Aug 10 13:03:42 IBMx336-200-133 kernel: lpfc 0000:02:00.1: 1:1303 Link Up Event x1 received Data: x1 x1 x10 x9 x0 x0 0
Aug 10 13:03:42 IBMx336-200-133 kernel: lpfc 0000:02:00.0: 0:(0):0231 RSCN timeout Data: x0 x3
Aug 10 13:03:42 IBMx336-200-133 kernel: lpfc 0000:02:00.0: 0:0459 Adapter heartbeat failure, taking this port offline.

Attaching /var/log/messages during port offline.

How reproducible:
Intermittent
 
Versions:
RHEL5.4 Snapshot5
kernel - 2.6.18-160.el5
lpfc driver version - 8.2.0.48.2p


Steps to reproduce:
1) Map LUNs from NetApp controller to RHEL5.4 host with Emulex FC HBA card.
2) Configure Logical volumes on these LUNs and run I/O
3) Now do I/O faults on NetApp controller
4) After few faults, it can be seen that Emulex FC HBA ports go offline. 

Actual results:
Emulex FC HBA ports are going offline.
 
Expected Results:
Emulex FC HBA ports should not go offline.

Additional Info: 
This issue is seen in Snapshot5. Not seen in older releases.

Comment 1 Naveen Reddy 2009-08-10 11:12:23 UTC
Created attachment 356875 [details]
Adding /var/log/messages file (during the port offlining).

Comment 3 laurie barry 2009-08-10 14:28:07 UTC
It's possible you have bad hardware.  Are you seeing this only on a single configuration/HBA?

Adding Vaios Papadimitriou from the Emulex Linux support team to investigate.

Laurie

Comment 4 Martin George 2009-08-10 18:07:20 UTC
(In reply to comment #3)
> It's possible you have bad hardware.  Are you seeing this only on a single
> configuration/HBA?

We'll verify whether the issue is reproducible on other hosts.

But it seems that this is a regression of the Emulex driver from snap3 onwards. Have you modified the timeouts in the latest driver?

Comment 5 laurie barry 2009-08-10 18:13:35 UTC
Ok, you may be right, we're looking at it.

Laurie

Comment 6 Bino J Sebastian 2009-08-10 19:57:07 UTC
I need some more information about this issue

- How do you introduce IO faults. Is this done by disabling HBA link
- Is this a multipath environment ?

Could you reproduce this issue with lpfc_log_verbose module parameter set to 0xfefbf and send us the log file.

I reviewed the code change between these 2 releases, there are no changes in the
timeout values.

Comment 7 Martin George 2009-08-10 20:32:19 UTC
(In reply to comment #6)
> I need some more information about this issue
> 
> - How do you introduce IO faults. Is this done by disabling HBA link

A clustered NetApp controller pair is used as the target here. The IO faults mentioned above are actually an individual node taking over and giving back control over its partner node. This would obviously include target HBA ports logging in and out of the fabric. 

> - Is this a multipath environment ?

Yes.

> 
> Could you reproduce this issue with lpfc_log_verbose module parameter set to
> 0xfefbf and send us the log file.

Will do.

Comment 8 Naveen Reddy 2009-08-11 03:36:02 UTC
This issue is seen on other hosts with Emulex FC HBA cards as well. 

And I will provide the logs with verbosity increased.

Comment 9 Martin George 2009-08-11 18:54:03 UTC
So far we've not been able to hit the issue with enhanced verbosity. We'll rerun the tests again to see if this is still reproducible.

Comment 10 Naveen Reddy 2009-08-12 05:56:02 UTC
Created attachment 357118 [details]
Attaching the /var/log/messages with more verbosity

Reproduced the issue by setting the verbosity to 0xfefb.

Comment 11 Martin George 2009-08-18 14:14:15 UTC
Laurie/Bino,

Do you have any updates on this?

Comment 12 Andrius Benokraitis 2009-08-24 13:06:44 UTC
Laurie - Can you acknowledge if this is actually an issue?

Comment 13 laurie barry 2009-08-24 14:41:53 UTC
Martin,

We reviewed the logfile attached to the bugzilla. This looks like a hardware issue. But you've stated you are seeing this on more than one HBA, is that right?  We are seeing both the ports of the HBA goes to a non-responsive state at same time. 

You've also indicated that it is a regression of the Emulex driver from snap3 onwards so we are trying to understand what changed in that time period that could have contributed to this issue.

Laurie

Comment 14 Martin George 2009-08-24 14:57:45 UTC
(In reply to comment #13)
> Martin,
> 
> We reviewed the logfile attached to the bugzilla. This looks like a hardware
> issue. But you've stated you are seeing this on more than one HBA, is that
> right?  We are seeing both the ports of the HBA goes to a non-responsive state
> at same time. 

Yes, we have hit this issue on multiple hosts.

> 
> You've also indicated that it is a regression of the Emulex driver from snap3
> onwards so we are trying to understand what changed in that time period that
> could have contributed to this issue.

We are hitting this issue with the latest lpfc driver v8.2.0.48.2p and not with the previous v8.2.0.48.1p.

Comment 15 laurie barry 2009-08-25 18:15:25 UTC
Martin,

The Adapter heartbeat failure messages we see are indicating a possible issue w/ the HBA/Firmware that cause the HBA go into an unresponsive state, that is why we indicated this is possible a HW failure. So, far our conclusion is that the driver is behaving as expected.

As to the differences between the 8.2.0.48.1p (Snap3) and 8.2.0.48.2p (RC) driver versions, this is from the driver's ChangeLog:
...
Changes from 20090709 to 20090716

	* Changed version number to 8.2.0.48.2p
	* Fixed panic in menlo sysfs handler
	* Fixed unsolicited CT commands crashing kernel
	* Fixed persistent post state to use config region 23 (CR 91320)
	* Fixed panic/hang when using polling mode for FCP commands
	  (CR 91684)
	* Fix crash when "error" is echoed to board_mode sysfs parameter
...

We reviewed all the changes between the two driver revs and we do not see any change in the driver code that can result in this behavior.

Would it be possible to answer the following questions that will help us expedite root-cause and resolution of this issue:

1. What are the HBAs you are using for your testing? You mentioned you saw this behavior on multiple systems, were all the same HBA family (LPe11K, LP10K etc)?
Also, what is the firmware rev of these HBAs?

2. Is the failure behavior consistent w/ the 8.2.0.48.2p driver on all tested HBAs/systems, or is it intermittent?

3. What do you use for multipathing?

In the meantime we'll try to reproduce this behavior in our lab, based on our available hardware.

Thanks,
-Vaios-

Comment 16 Naveen Reddy 2009-08-26 05:48:11 UTC
Laurie,

1. HBA - LPe11002-M4. 
   We saw behavior on other system with same HBA model(LPe11002-M4).

2. The issue is intermittent.

3. We use Device Mapper Multipathing shipped along with RHEL OS.

Comment 17 Naveen Reddy 2009-08-26 08:51:09 UTC
And the HBA firmware version is 2.82A3.

Comment 18 Rob Evers 2009-09-02 18:13:02 UTC
Has there been any progress in attempting to repoduce this at Emulex?

Is all the required information available to reproduce this or is something else required?

If attempts to reproduce have been made at Emulex and not reproduced, would it be possible to get an un-bundled patch set to netapps that could be bisected to see if a particular patch of the bundle induces the problem?  Perhaps this could be posted in this bugzilla?

Comment 19 Andrius Benokraitis 2009-09-14 13:05:47 UTC
Emulex - any status on Rob's query in Comment #18?

Comment 21 Vaios Papadimitriou 2009-10-02 21:56:42 UTC
This issue is currently investigated by Emulex's firmware team. Will requested additional information from Netapp as needed.

Comment 22 Martin George 2009-10-19 13:50:13 UTC
This issue is seen only with the following config - RHEL 5.4 GA + Emulex LPe11k adapters + fw v2.82A3. 

Interestingly, it is not seen with RHEL 5.4 GA + LP11k + fw 2.82A3, RHEL 5.4 GA + LPe11k + fw v2.80, etc.

Still working with Emulex on this.

Comment 23 David Fairbanks 2009-10-20 17:56:38 UTC
Stratus Technologies Inc. is also seeing this bug.

We are using an Emulex lpe1150 HBA. We see this issue only when using Emulex HBA FW revision 2.82a3. We do not see this issue using FW revision 2.80a4.

We are using driver version 8.2.0.48.2p

We can reproduce this within 15 -20 minutes.

Comment 24 Martin George 2009-10-28 08:10:36 UTC
As suggested by Emulex, I disabled Message Signaled Interrupts by setting 'lpfc_use_msi=0' for the LPe11k adapter. And I've not been able to hit the port offline issue after that.

So is this a problem with the MSI handling in fw v2.82a3?

Comment 25 David Fairbanks 2009-10-28 15:43:33 UTC
Also as requested by Emulex, I have disabled MSI interrupts (same method as previous comment) and re-ran our test.

The test ran for over 24 hours. No failures were seen.

Comment 26 Andrius Benokraitis 2009-11-02 14:13:18 UTC
Emulex - has this fix already been included in another bugzilla or wholesale 5.5 patchset?

Comment 27 Vaios Papadimitriou 2009-11-06 20:01:26 UTC
Yes, a resolution for this issue will be part of the next LPFC driver patch that will be submitted for RHEL5.5.

Comment 28 Martin George 2009-11-09 07:47:15 UTC
Would this make it to the next RHEL 5.4 errata release?

Comment 29 Rob Evers 2009-11-09 14:32:21 UTC
(In reply to comment #27)
> Yes, a resolution for this issue will be part of the next LPFC driver patch
> that will be submitted for RHEL5.5. 

This is not acceptable because a discreet patch addressing this particular issue for RHEL5.5 is required to get the fix into the RHEL5.4 stream.

Comment 30 Vaios Papadimitriou 2009-11-13 20:45:06 UTC
Created attachment 369483 [details]
LPFC 8.2.0.48.2p to 8.2.0.48.3p patch

Comment 31 Vaios Papadimitriou 2009-11-13 20:46:03 UTC
A discreet LPFC driver patch that addresses this issue was attached. This patch also includes an update of the LPFC driver version to 8.2.0.48.3p. It applies on top of RHEL5.4 GA 8.2.0.48.2p LPFC version.

These are the changes included in this patch:
* Changed version number to 8.2.0.48.3p
* Fix for lost MSI interrupt (CR 95404)

Comment 32 Andrius Benokraitis 2009-11-13 21:54:10 UTC
The discrete patch in this bugzilla has been rolled up in the wholesale driver update for 5.5 in bug 529244.

Comment 33 Chris Ward 2009-11-16 12:00:12 UTC
@NetApp

We need to confirm that there is third-party commitment to 
test for the resolution of this request during the RHEL 5.5 
Beta Test Phase before we can approve it for acceptance 
into the release.

RHEL 5.5 Beta Test Phase is expected to begin around February
2010.

In order to avoid any unnecessary delays, please post a 
confirmation as soon as possible, including the contact 
information for testing engineers.

Any additional information about alternative testing variations we 
could use to reproduce this issue in-house would be appreciated.

Comment 34 Chris Ward 2009-11-16 12:01:12 UTC
@Emulex, @Stratus. Comment #33 is relevant for each of you as well. Thanks!

Comment 35 laurie barry 2009-11-16 13:46:24 UTC
Agreed.

Laurie

Comment 37 Don Zickus 2009-12-11 19:29:02 UTC
in kernel-2.6.18-179.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 39 Martin George 2009-12-18 20:33:57 UTC
With the updated kernel v2.6.18-179.el5 mentioned, IO has been running successfully on my RHEL 5.4 host (with target controller faults) for more than 24 hours now.

Comment 41 Wayne Berthiaume 2010-01-07 16:17:27 UTC
Will the fix be provided in a RHEL 5.4.z release or only expected in RHEL 5.5?

Comment 42 Andrius Benokraitis 2010-01-07 16:24:49 UTC
Both! See bug 549906 for the 5.4.z.

Comment 43 Wayne Berthiaume 2010-01-07 16:59:38 UTC
Thanks Andrius. 

We've replicated the bug here as well but will regress RHEL 5.4 once the errata is available.

Regards,
Wayne.

Comment 47 errata-xmlrpc 2010-03-30 06:58:13 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 48 Red Hat Bugzilla 2023-09-14 01:17:37 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.