Red Hat Bugzilla – Bug 467153
[QLogic 5.3 bug] latest qlogic driver takes several minutes to find LUNs on older qla2xx controller
Last modified: 2009-06-20 03:40:30 EDT
Description of problem:
On one of my HP ia64 systems whith an HP rebranded qlogic controller the recent kernels take several minutes to scan the luns. The LUNs are hosted by an HP MSA1000 (I don't know if that is significant).
I have another HP Integrity server with a newer model qlogic card that works OK. That server is connected to the same MSA1000.
This was introduced in kernel-2.6.18-118, I am testing that kernel + some patches from Marcus and Mike Christie that resolve some other issues with the qlogic driver in that kernel.
In kernel-2.6.18-117 and earlier the luns are scanned instantly. In -118 it waits about 4.5 minutes and then I see these kernel messages:
qla2xxx 0000:80:02.0: Performing ISP error recovery - ha= e0000040f1e104f8.
qla2xxx 0000:80:02.0: LIP reset occured (f7f7).
qla2xxx 0000:80:02.0: LOOP UP detected (2 Gbps).
Vendor: COMPAQ Model: MSA1000 VOLUME Rev: 4.48
Type: Direct-Access ANSI SCSI revision: 04
... then messages from finding the MSA1000 LUNs as expected....
The critical aspect of this bug is that due to how initrd handles parallel lun scanning this makes it impossible to boot from any of the disks off of this controller.
Version-Release number of selected component (if applicable):
100% on this particular system.
Steps to Reproduce:
Created attachment 320509 [details]
messages with timestamps
I found I can reproduce this easily at runtime by rmmod qla2xxx / modprobe qla2xxx. I captured the logs from /var/log/messages so you can see the timestamps and can see exactly where the long 4+ minute pause is.
If you could try reproducing this with "ql2xextended_error_logging=1" appended
to the modprobe line, that would be a big help.
The log shows that there was a timeout and error recovery was performed;.
Perhaps more logging would point to a clue...
Created attachment 320550 [details]
logs with ql2xextended_error_logging=1
Created attachment 320596 [details]
Don't do NPIV table init for older HBA's
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
You can download this test kernel from http://people.redhat.com/dzickus/el5
RHEL 5.3 public Beta will be released soon. This URGENT severity bug
should have a fix in place in the recently released Partner Alpha drop,
available at ftp://partners.redhat.com. If you haven't had a chance yet to test
this bug, please do so at your earliest convenience, to ensure the highest
possible quality bits in the upcoming Beta drop.
Thanks, more information about Beta testing to come.
- Red Hat QE Partner Management
The situation seems to get worse a little bit.
kernel-2.6.18-120.el5 can't find all LUNs connected to 4G-FC HBA.
I used QLA2460 and QLA2462 on an NEC ia64 machine.
The machine has also 2G-FC HBAs (QCP2340, QLA2342). kernel-2.6.18-118.el5 took 4-5 minutes to find LUNs connected them, but now kernel-2.6.18-120.el5 works well with them.
So new issue for 4G-FC HBA appeared despite the issue for 2G-FC HBA is solved.
On my test bed, this driver has been finding Luns. Can you re-run your test after loading the driver with "ql2xextended_error_logging=1" set and attach the log?
Also could you describe your storage setup including how many LUNS should be found.
Created attachment 321456 [details]
dmesg loading with ql2xextended_error_logging=1 on an NEC machine
The attachment is dmesg after
# modprobe qla2xxx ql2xextended_error_logging=1
on an NEC machine.
FC storages should be found as sdg--sdaj (30 LUNs) but they are not.
Storages on the machine are set up as below.
(S1) LSI Logic LSI22320 (Ultra320 SCSI) --> HDD 0-0, 0-1
(S2) LSI Logic LSI22320 (Ultra320 SCSI) --> DAT
(S3) LSI Logic LSI22320 (Ultra320 SCSI) --> HDD 1-0, 1-1
(S4) LSI Logic LSI22320 (Ultra320 SCSI) --> HDD 1-2, 1-3
(F1) Qlogic QLA2460 (4G-FC, single) --> FC-Storage
(F2) Qlogic QLA2462 (4G-FC, dual): port1 --> FC-Storage
: port2 --> (N/C)
(F1) and (F2) are connected same FC-Storage and construct 2-path-multi-path.
The FC-Storage has 15 LUNs. So the kernel should recognize 30 LUNs on the FC-Strage.
Partners, this bug should be fixed in the latest RHEL 5.3 Snapshot. We believe that you have some interest in its correct functionality, so we're making a friendly request to send us some testing feedback.
If you have a chance to test it, please share with us your findings. If you have successfully VERIFIED the fix, please add PartnerVerified to the Bugzilla keywords, along with a description of the results. Thanks!
patch verified in 2.6.18-123.el5
The second problem described in this BZ by Munehiro IKEDA, is resolved in a newer BZ, BZ 471269.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.