Bug 467153

Summary: [QLogic 5.3 bug] latest qlogic driver takes several minutes to find LUNs on older qla2xx controller
Product: Red Hat Enterprise Linux 5 Reporter: Doug Chapman <dchapman>
Component: kernelAssignee: Marcus Barrow <mbarrow>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: urgent Docs Contact:
Priority: medium    
Version: 5.3CC: andrew.vasquez, andriusb, berthiaume_wayne, coughlan, cward, dwa, dzickus, kueda, m-ikeda, mikeda, qlogic-redhat-ext, rpacheco
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 20:16:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 415811    
Attachments:
Description Flags
messages with timestamps
none
logs with ql2xextended_error_logging=1
none
Don't do NPIV table init for older HBA's
none
dmesg loading with ql2xextended_error_logging=1 on an NEC machine none

Description Doug Chapman 2008-10-16 02:33:03 UTC
Description of problem:
On one of my HP ia64 systems whith an HP rebranded qlogic controller the recent kernels take several minutes to scan the luns.  The LUNs are hosted by an HP MSA1000 (I don't know if that is significant).

I have another HP Integrity server with a newer model qlogic card that works OK.  That server is connected to the same MSA1000.


This was introduced in kernel-2.6.18-118, I am testing that kernel + some patches from Marcus and Mike Christie that resolve some other issues with the qlogic driver in that kernel.

In kernel-2.6.18-117 and earlier the luns are scanned instantly.  In -118 it waits about 4.5 minutes and then I see these kernel messages:

qla2xxx 0000:80:02.0: Performing ISP error recovery - ha= e0000040f1e104f8.
qla2xxx 0000:80:02.0: LIP reset occured (f7f7).
qla2xxx 0000:80:02.0: LOOP UP detected (2 Gbps).
  Vendor: COMPAQ    Model: MSA1000 VOLUME    Rev: 4.48
  Type:   Direct-Access                      ANSI SCSI revision: 04

... then messages from finding the MSA1000 LUNs as expected....


The critical aspect of this bug is that due to how initrd handles parallel lun scanning this makes it impossible to boot from any of the disks off of this controller.


Version-Release number of selected component (if applicable):
kernel-2.6.18-118

How reproducible:
100% on this particular system.


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Doug Chapman 2008-10-16 02:39:21 UTC
Created attachment 320509 [details]
messages with timestamps

I found I can reproduce this easily at runtime by rmmod qla2xxx / modprobe qla2xxx.  I captured the logs from /var/log/messages so you can see the timestamps and can see exactly where the long 4+ minute pause is.

Comment 2 Marcus Barrow 2008-10-16 04:03:24 UTC
If you could try reproducing this with "ql2xextended_error_logging=1" appended
to the modprobe line, that would be a big help.

The log shows that there was a timeout and error recovery was performed;.
Perhaps more logging would point to a clue...

Comment 3 Doug Chapman 2008-10-16 12:40:40 UTC
Created attachment 320550 [details]
logs with ql2xextended_error_logging=1

Comment 4 Marcus Barrow 2008-10-16 19:24:15 UTC
Created attachment 320596 [details]
Don't do NPIV table init for older HBA's

Comment 5 RHEL Program Management 2008-10-17 15:27:49 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Don Zickus 2008-10-20 15:13:51 UTC
in kernel-2.6.18-120.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 9 Chris Ward 2008-10-21 13:10:35 UTC
Attention Partners! 

RHEL 5.3 public Beta will be released soon. This URGENT severity bug
should have a fix in place in the recently released Partner Alpha drop,
available at ftp://partners.redhat.com. If you haven't had a chance yet to test
this bug, please do so at your earliest convenience, to ensure the highest
possible quality bits in the upcoming Beta drop.

Thanks, more information about Beta testing to come.

 - Red Hat QE Partner Management

Comment 10 Munehiro IKEDA 2008-10-23 23:48:12 UTC
The situation seems to get worse a little bit.
kernel-2.6.18-120.el5 can't find all LUNs connected to 4G-FC HBA.
I used QLA2460 and QLA2462 on an NEC ia64 machine.
The machine has also 2G-FC HBAs (QCP2340, QLA2342).  kernel-2.6.18-118.el5 took 4-5 minutes to find LUNs connected them, but now kernel-2.6.18-120.el5 works well with them.
So new issue for 4G-FC HBA appeared despite the issue for 2G-FC HBA is solved.

Comment 11 Marcus Barrow 2008-10-24 14:30:42 UTC
On my test bed, this driver has been finding Luns. Can you re-run your test after loading the driver with "ql2xextended_error_logging=1" set and attach the log?

Also could you describe your storage setup including how many LUNS should be found.

Thanks.

Comment 12 Munehiro IKEDA 2008-10-24 19:10:33 UTC
Created attachment 321456 [details]
dmesg loading with ql2xextended_error_logging=1 on an NEC machine

The attachment is dmesg after
  # modprobe qla2xxx ql2xextended_error_logging=1
on an NEC machine.
FC storages should be found as sdg--sdaj (30 LUNs) but they are not.

Storages on the machine are set up as below.

(S1) LSI Logic LSI22320 (Ultra320 SCSI) --> HDD 0-0, 0-1
(S2) LSI Logic LSI22320 (Ultra320 SCSI) --> DAT
(S3) LSI Logic LSI22320 (Ultra320 SCSI) --> HDD 1-0, 1-1
(S4) LSI Logic LSI22320 (Ultra320 SCSI) --> HDD 1-2, 1-3

(F1) Qlogic QLA2460 (4G-FC, single)      --> FC-Storage
(F2) Qlogic QLA2462 (4G-FC, dual): port1 --> FC-Storage
                                 : port2 --> (N/C)

(F1) and (F2) are connected same FC-Storage and construct 2-path-multi-path.
The FC-Storage has 15 LUNs.  So the kernel should recognize 30 LUNs on the FC-Strage.

Comment 13 Chris Ward 2008-11-28 07:15:50 UTC
Partners, this bug should be fixed in the latest RHEL 5.3 Snapshot. We believe that you have some interest in its correct functionality, so we're making a friendly request to send us some testing feedback. 

If you have a chance to test it, please share with us your findings. If you have successfully VERIFIED the fix, please add PartnerVerified to the Bugzilla keywords, along with a description of the results. Thanks!

Comment 14 Marcus Barrow 2008-12-02 16:06:49 UTC
patch verified in 2.6.18-123.el5

The second problem described in this BZ by Munehiro IKEDA, is resolved in a newer BZ, BZ 471269.

Comment 16 errata-xmlrpc 2009-01-20 20:16:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html