Description: On RHEL5.4 Snapshot5 host with Emulex FC HBA, after running I/O faults on netapp controllers the HBA ports are going offline. Adapter hearbeat failure messages are displayed as follows. Aug 10 13:03:42 IBMx336-200-133 kernel: lpfc 0000:02:00.1: 1:(0):0203 Devloss timeout on WWPN 50:0a:09:83:89:ba:e6:0b NPort x670500 Data: x0 x7 x0 Aug 10 13:03:42 IBMx336-200-133 kernel: lpfc 0000:02:00.1: 1:1303 Link Up Event x1 received Data: x1 x1 x10 x9 x0 x0 0 Aug 10 13:03:42 IBMx336-200-133 kernel: lpfc 0000:02:00.0: 0:(0):0231 RSCN timeout Data: x0 x3 Aug 10 13:03:42 IBMx336-200-133 kernel: lpfc 0000:02:00.0: 0:0459 Adapter heartbeat failure, taking this port offline. Attaching /var/log/messages during port offline. How reproducible: Intermittent Versions: RHEL5.4 Snapshot5 kernel - 2.6.18-160.el5 lpfc driver version - 8.2.0.48.2p Steps to reproduce: 1) Map LUNs from NetApp controller to RHEL5.4 host with Emulex FC HBA card. 2) Configure Logical volumes on these LUNs and run I/O 3) Now do I/O faults on NetApp controller 4) After few faults, it can be seen that Emulex FC HBA ports go offline. Actual results: Emulex FC HBA ports are going offline. Expected Results: Emulex FC HBA ports should not go offline. Additional Info: This issue is seen in Snapshot5. Not seen in older releases.
Created attachment 356875 [details] Adding /var/log/messages file (during the port offlining).
It's possible you have bad hardware. Are you seeing this only on a single configuration/HBA? Adding Vaios Papadimitriou from the Emulex Linux support team to investigate. Laurie
(In reply to comment #3) > It's possible you have bad hardware. Are you seeing this only on a single > configuration/HBA? We'll verify whether the issue is reproducible on other hosts. But it seems that this is a regression of the Emulex driver from snap3 onwards. Have you modified the timeouts in the latest driver?
Ok, you may be right, we're looking at it. Laurie
I need some more information about this issue - How do you introduce IO faults. Is this done by disabling HBA link - Is this a multipath environment ? Could you reproduce this issue with lpfc_log_verbose module parameter set to 0xfefbf and send us the log file. I reviewed the code change between these 2 releases, there are no changes in the timeout values.
(In reply to comment #6) > I need some more information about this issue > > - How do you introduce IO faults. Is this done by disabling HBA link A clustered NetApp controller pair is used as the target here. The IO faults mentioned above are actually an individual node taking over and giving back control over its partner node. This would obviously include target HBA ports logging in and out of the fabric. > - Is this a multipath environment ? Yes. > > Could you reproduce this issue with lpfc_log_verbose module parameter set to > 0xfefbf and send us the log file. Will do.
This issue is seen on other hosts with Emulex FC HBA cards as well. And I will provide the logs with verbosity increased.
So far we've not been able to hit the issue with enhanced verbosity. We'll rerun the tests again to see if this is still reproducible.
Created attachment 357118 [details] Attaching the /var/log/messages with more verbosity Reproduced the issue by setting the verbosity to 0xfefb.
Laurie/Bino, Do you have any updates on this?
Laurie - Can you acknowledge if this is actually an issue?
Martin, We reviewed the logfile attached to the bugzilla. This looks like a hardware issue. But you've stated you are seeing this on more than one HBA, is that right? We are seeing both the ports of the HBA goes to a non-responsive state at same time. You've also indicated that it is a regression of the Emulex driver from snap3 onwards so we are trying to understand what changed in that time period that could have contributed to this issue. Laurie
(In reply to comment #13) > Martin, > > We reviewed the logfile attached to the bugzilla. This looks like a hardware > issue. But you've stated you are seeing this on more than one HBA, is that > right? We are seeing both the ports of the HBA goes to a non-responsive state > at same time. Yes, we have hit this issue on multiple hosts. > > You've also indicated that it is a regression of the Emulex driver from snap3 > onwards so we are trying to understand what changed in that time period that > could have contributed to this issue. We are hitting this issue with the latest lpfc driver v8.2.0.48.2p and not with the previous v8.2.0.48.1p.
Martin, The Adapter heartbeat failure messages we see are indicating a possible issue w/ the HBA/Firmware that cause the HBA go into an unresponsive state, that is why we indicated this is possible a HW failure. So, far our conclusion is that the driver is behaving as expected. As to the differences between the 8.2.0.48.1p (Snap3) and 8.2.0.48.2p (RC) driver versions, this is from the driver's ChangeLog: ... Changes from 20090709 to 20090716 * Changed version number to 8.2.0.48.2p * Fixed panic in menlo sysfs handler * Fixed unsolicited CT commands crashing kernel * Fixed persistent post state to use config region 23 (CR 91320) * Fixed panic/hang when using polling mode for FCP commands (CR 91684) * Fix crash when "error" is echoed to board_mode sysfs parameter ... We reviewed all the changes between the two driver revs and we do not see any change in the driver code that can result in this behavior. Would it be possible to answer the following questions that will help us expedite root-cause and resolution of this issue: 1. What are the HBAs you are using for your testing? You mentioned you saw this behavior on multiple systems, were all the same HBA family (LPe11K, LP10K etc)? Also, what is the firmware rev of these HBAs? 2. Is the failure behavior consistent w/ the 8.2.0.48.2p driver on all tested HBAs/systems, or is it intermittent? 3. What do you use for multipathing? In the meantime we'll try to reproduce this behavior in our lab, based on our available hardware. Thanks, -Vaios-
Laurie, 1. HBA - LPe11002-M4. We saw behavior on other system with same HBA model(LPe11002-M4). 2. The issue is intermittent. 3. We use Device Mapper Multipathing shipped along with RHEL OS.
And the HBA firmware version is 2.82A3.
Has there been any progress in attempting to repoduce this at Emulex? Is all the required information available to reproduce this or is something else required? If attempts to reproduce have been made at Emulex and not reproduced, would it be possible to get an un-bundled patch set to netapps that could be bisected to see if a particular patch of the bundle induces the problem? Perhaps this could be posted in this bugzilla?
Emulex - any status on Rob's query in Comment #18?
This issue is currently investigated by Emulex's firmware team. Will requested additional information from Netapp as needed.
This issue is seen only with the following config - RHEL 5.4 GA + Emulex LPe11k adapters + fw v2.82A3. Interestingly, it is not seen with RHEL 5.4 GA + LP11k + fw 2.82A3, RHEL 5.4 GA + LPe11k + fw v2.80, etc. Still working with Emulex on this.
Stratus Technologies Inc. is also seeing this bug. We are using an Emulex lpe1150 HBA. We see this issue only when using Emulex HBA FW revision 2.82a3. We do not see this issue using FW revision 2.80a4. We are using driver version 8.2.0.48.2p We can reproduce this within 15 -20 minutes.
As suggested by Emulex, I disabled Message Signaled Interrupts by setting 'lpfc_use_msi=0' for the LPe11k adapter. And I've not been able to hit the port offline issue after that. So is this a problem with the MSI handling in fw v2.82a3?
Also as requested by Emulex, I have disabled MSI interrupts (same method as previous comment) and re-ran our test. The test ran for over 24 hours. No failures were seen.
Emulex - has this fix already been included in another bugzilla or wholesale 5.5 patchset?
Yes, a resolution for this issue will be part of the next LPFC driver patch that will be submitted for RHEL5.5.
Would this make it to the next RHEL 5.4 errata release?
(In reply to comment #27) > Yes, a resolution for this issue will be part of the next LPFC driver patch > that will be submitted for RHEL5.5. This is not acceptable because a discreet patch addressing this particular issue for RHEL5.5 is required to get the fix into the RHEL5.4 stream.
Created attachment 369483 [details] LPFC 8.2.0.48.2p to 8.2.0.48.3p patch
A discreet LPFC driver patch that addresses this issue was attached. This patch also includes an update of the LPFC driver version to 8.2.0.48.3p. It applies on top of RHEL5.4 GA 8.2.0.48.2p LPFC version. These are the changes included in this patch: * Changed version number to 8.2.0.48.3p * Fix for lost MSI interrupt (CR 95404)
The discrete patch in this bugzilla has been rolled up in the wholesale driver update for 5.5 in bug 529244.
@NetApp We need to confirm that there is third-party commitment to test for the resolution of this request during the RHEL 5.5 Beta Test Phase before we can approve it for acceptance into the release. RHEL 5.5 Beta Test Phase is expected to begin around February 2010. In order to avoid any unnecessary delays, please post a confirmation as soon as possible, including the contact information for testing engineers. Any additional information about alternative testing variations we could use to reproduce this issue in-house would be appreciated.
@Emulex, @Stratus. Comment #33 is relevant for each of you as well. Thanks!
Agreed. Laurie
in kernel-2.6.18-179.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
With the updated kernel v2.6.18-179.el5 mentioned, IO has been running successfully on my RHEL 5.4 host (with target controller faults) for more than 24 hours now.
Will the fix be provided in a RHEL 5.4.z release or only expected in RHEL 5.5?
Both! See bug 549906 for the 5.4.z.
Thanks Andrius. We've replicated the bug here as well but will regress RHEL 5.4 once the errata is available. Regards, Wayne.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days