174427 – SCSI errors with latest qlogic driver

Bug 174427 - SCSI errors with latest qlogic driver

Summary: SCSI errors with latest qlogic driver

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.3
Hardware:	ia64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Mike Christie
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	175534 (view as bug list)
Depends On:
Blocks:	168429 175195
TreeView+	depends on / blocked

Reported:	2005-11-28 22:54 UTC by Doug Chapman
Modified:	2007-11-30 22:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:	RHSA-2006-0132
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-03-07 20:53:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
dmesg from booting with scsi errs (27.56 KB, text/plain) 2005-11-29 18:13 UTC, Doug Chapman	no flags	Details
qla2xxx log (103.41 KB, text/plain) 2005-12-01 22:34 UTC, Mike Christie	no flags	Details
bad patch (2.55 KB, patch) 2005-12-02 01:10 UTC, Mike Christie	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2006:0132	0	qe-ready	SHIPPED_LIVE	Moderate: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 3	2006-03-09 16:31:00 UTC

Description Doug Chapman 2005-11-28 22:54:07 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050719 Red Hat/1.0.6-1.4.1 Firefox/1.0.6

Description of problem:
Starting with 2.6.9-22.24.EL I am seeing SCSI errors from an HP storage array connected via a qlogic FC on my HP Integrity ia64 system.  Rebooting back to 2.6.9-22.23.EL cleans the problem up.  The later kernels up to kernel-2.6.9-22.27.EL have this problem as well.

The errors are seen during bootup when LVM is looking for volumes:

Scanning logical volumes
  Reading all physical volumes.  This may take a while...
SCSI error : <2 0 0 1> return code = 0x20000
end_request: I/O error, dev sdd, sector 0
Buffer I/O error on device sdd, logical block 0
SCSI error : <2 0 0 1> return code = 0x20000
end_request: I/O error, dev sdd, sector 32
Buffer I/O error on device sdd, logical block 1
SCSI error : <2 0 0 1> return code = 0x20000
end_request: I/O error, dev sdd, sector 64
Buffer I/O error on device sdd, logical block 2
SCSI error : <2 0 0 1> return code = 0x20000
end_request: I/O error, dev sdd, sector 96
Buffer I/O error on device sdd, logical block 3
SCSI error : <2 0 0 1> return code = 0x20000
end_request: I/O error, dev sdd, sector 0
Buffer I/O error on device sdd, logical block 0
  /dev/sdd: read failed after 0 of 16384 at 0: Input/output error
SCSI error : <2 0 0 1> return code = 0x20000
end_request: I/O error, dev sdd, sector 104857472
  /dev/sdd: read failed after 0 of 16384 at 53687025664: Input/output error
SCSI error : <2 0 0 1> return code = 0x20000
end_request: I/O error, dev sdd, sector 0
  /dev/sdd: read failed after 0 of 16384 at 0: Input/output error
SCSI error : <2 0 0 2> return code = 0x20000
end_request: I/O error, dev sde, sector 0
Buffer I/O error on device sde, logical block 0
SCSI error : <2 0 0 2> return code = 0x20000
end_request: I/O error, dev sde, sector 32
Buffer I/O error on device sde, logical block 1
SCSI error : <2 0 0 2> return code = 0x20000
end_request: I/O error, dev sde, sector 64


... and so on ...

The messages continue to occur every once booted every few minutes while idle.  



Version-Release number of selected component (if applicable):
kernel-2.6.9-22.24.EL

How reproducible:
Always

Steps to Reproduce:
1. Boot an HP Integrity system with qlogic FC connected to an HP VA7100 array
2. 
3.
  

Actual Results:  SCSI errors and very slow booting

Additional info:

Comment 1 Mike Christie 2005-11-29 17:18:32 UTC

Looks like it might be a connection problem. Could you send the rest of the log?
Are there any other qlogic/qla2xxx or scsi messages in the log?

Comment 2 Mike Christie 2005-11-29 17:32:42 UTC

It looks like the larger qlogic update got in at 22.11 here
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=168544

And all that got merged into 22.24 was
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=149294
transport rediscovery. I was going to ask you to back out that patch, but the
patch does not touch the drivers core code. It just adds a sysfs attribute.

Comment 3 Doug Chapman 2005-11-29 18:13:16 UTC

Created attachment 121603 [details]
dmesg from booting with scsi errs

Comment 4 Doug Chapman 2005-11-29 18:14:03 UTC

I have attached the dmesg output from booting with 2.6.9-23.  I don't notice any
other SCSI messages beyond what I had posted however I am no storage expert.

If the qlogic driver did not change then perhaps it is a dm issue?  These disks
are dual path.  I will pull one path and reboot to see if the problem goes away.

Comment 5 Doug Chapman 2005-11-29 18:23:29 UTC

The problems go away when I remove the second FC path to the disks.  Looks like
this is probably a device mapper bug, not a qlogic driver bug.  Sorry for my
incorret assumption on this.

Comment 6 Doug Chapman 2005-11-30 18:33:01 UTC

I see this was set as closed.  Did I accidentally do that?  It is still a
problem, just not a qlogic driver problem.  It needs to be assigned to the dm
maintainer.

Comment 7 Jason Baron 2005-11-30 18:40:36 UTC

ok. i'm not sure if this is a dm problem

Comment 8 Mike Christie 2005-11-30 19:41:21 UTC

Can you verify that with the setup that works for pre .24 kernels everything was
plugged in, or did you have it setup with only one path origianlly when it
worked? From comment #5 it looks like the storage might be some sort of
Active/Passive if it turned out that IO to one path works but IO to the other
did not. I did not think we would get 0x20000 errors though (0x20000 is
DID_BUS_BUSY right).

Is the path we get failed IO configured differently or setup through more
switches or anything like that?

Could you also just test this with the newest kernel with all the new qlogic
patches?

Comment 9 Doug Chapman 2005-11-30 20:29:56 UTC

Previous to .24 it worked fine and was using both paths as Active-Active.  I
will try with the latest kernel and the older qlogic driver tomorrow.  I am out
of the office today and I had pulled a cable to work around this temporarily.  I
will try this out first thing tomorrow.

Comment 10 Doug Chapman 2005-11-30 22:51:15 UTC

I am having difficulty getting the 22.23 driver to load on 2.6.9-22.27.  Did we
have a kernel abi change for fc_attach_transport?  This is quite possibly user
error on my part so here is what I have done and what I am seeing:

1. copied kernel/drivers/scsi/qla1280.ko and kernel/drivers/scsi/qla2xxx/* from
the 2.6.9-22.23 tree to the 2.6.9-22.27 tree.

2. made a new initrd image via:
 mkinitrd /boot/efi/efi/redhat/initrd-2.6.9-22.27.EL-oldql.img 2.6.9-22.27.EL

3. modified elilo.conf to use this initrd with the 2.6.9.22-27 kernel

4. booted, ensured it was using the right initrd image

During bootup, or when I try loading the modules manually later I get:
ksign: module signed with unknown public key
- signature keyid: 681ed4f9b8c9fe81 ver=3
qla2xxx: disagrees about version of symbol fc_attach_transport
qla2xxx: Unknown symbol fc_attach_transport
ksign: module signed with unknown public key
- signature keyid: 681ed4f9b8c9fe81 ver=3
qla2300: Unknown symbol qla2x00_remove_one
qla2300: Unknown symbol qla2x00_probe_one


I see that fc_attach_transport is in scsi_transport_fc and that module is loaded
.  Do we have a kabi change causing a problem here?

Comment 11 Jason Baron 2005-11-30 22:57:17 UTC

that's right. I guess you could try the old scsi_transport_fc module as well. If
that still doesn't work, i guess we could try backing out the qlogic changes.

Comment 12 Mike Christie 2005-12-01 00:12:22 UTC

You need the transport class module scsi_transport_fc too.

But could you also try the newest kernel with the driver that is in there? Just
in case it is a qlogic bug that got fixed.

Comment 13 Doug Chapman 2005-12-01 15:32:11 UTC

More updates:

The latest kernel - 2.6.9-24 exhibits the same problem.

I booted 2.6.9-22.27 with the qlogic driver and scsi_transport_fc from
2.6.9-22.23 and it worked OK.  The version of the qlogic driver is 8.01.02-d2
where the later versions where it doesn't work is 8.01.02-d3.

So, I would say this definitly points to a driver regression.

Comment 15 Mike Christie 2005-12-01 18:37:28 UTC

What qlogic card was this with? Could I get access to the machine? The logs have
nothing too useful. I want to just peal of the patches that got merged and see
which one it is.

Comment 16 Doug Chapman 2005-12-01 19:28:48 UTC

HP refers to the card as: "PCI-X dual Channel 2Gb Fibre Channel HBA" A6826A

I assume qlogic has another name for it as well.

I am moving my system out of my private net onto the .lab.boston.redhat.com net
so you can access it.  I will send you that info via email.

Comment 17 Mike Christie 2005-12-01 19:52:05 UTC

I must have looked at the commit messages wrong. Ignore my comment #22.

Comment 18 Mike Christie 2005-12-01 20:09:25 UTC

That should have been comment #2.

Comment 19 Mike Christie 2005-12-01 22:34:15 UTC

Created attachment 121716 [details]
qla2xxx log

Attaching log for qlogic guys. Here is the proc output requested too.



[root@hpcp1 ~]# more /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
   Vendor: HP 36.4G Model: MAU3036NC	    Rev: HPC2
   Type:   Direct-Access		    ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 01 Lun: 00
   Vendor: HP 36.4G Model: MAU3036NC	    Rev: HPC2
   Type:   Direct-Access		    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 00 Lun: 00
   Vendor: HP	    Model: A6188A	    Rev: HP14
   Type:   Direct-Access		    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 00 Lun: 01
   Vendor: HP	    Model: A6188A	    Rev: HP14
   Type:   Direct-Access		    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 00 Lun: 02
   Vendor: HP	    Model: A6188A	    Rev: HP14
   Type:   Direct-Access		    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 00 Lun: 03
   Vendor: HP	    Model: A6188A	    Rev: HP14
   Type:   Direct-Access		    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 00 Lun: 00
   Vendor: HP	    Model: A6188A	    Rev: HP14
   Type:   Direct-Access		    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 00 Lun: 01
   Vendor: HP	    Model: A6188A	    Rev: HP14
   Type:   Direct-Access		    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 00 Lun: 02
   Vendor: HP	    Model: A6188A	    Rev: HP14
   Type:   Direct-Access		    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 00 Lun: 03
   Vendor: HP	    Model: A6188A	    Rev: HP14




  /proc/scsi/qla2xxx/<id> output

there are two hosts. IO to host2 is the problem.


[root@hpcp1 ~]# more /proc/scsi/qla2xxx/2
QLogic PCI to Fibre Channel Host Adapter for HP A6826-60001:
	 Firmware version 3.03.18 IPX, Driver version 8.01.02-d3-debug
ISP: ISP2312, Serial# M18661
Request Queue = 0x3cd00000, Response Queue = 0x40424a0000
Request Queue count = 2048, Response Queue count = 512
Total number of active commands = 0
Total number of interrupts = 366
     Device queue depth = 0x20
Number of free request entries = 1822
Number of mailbox timeouts = 0
Number of ISP aborts = 0
Number of loop resyncs = 0
Number of retries for empty slots = 0
Number of reqs in pending_q= 0, retry_q= 0, done_q= 0, scsi_retry_q= 0
Host adapter:loop state = <UPDATE>, flags = 0x1a03
Dpc flags = 0x4000000
MBX flags = 0x0
Link down Timeout = 008
Port down retry = 016
Login retry count = 016
Commands retried with dropped frame(s) = 0
Product ID = 4953 5020 2020 0003


SCSI Device Information:
scsi-qla0-adapter-node=50060b0000326599;
scsi-qla0-adapter-port=50060b0000326598;
scsi-qla0-target-0=50060b00001470a2;

FC Port Information:
scsi-qla0-port-0=50060b000008af41:50060b00001470a2:010000:81;
scsi-qla0-port-1=50060b000032659b:50060b000032659a:010500:82;

SCSI LUN Information:
(Id:Lun)  * - indicates lun is not registered with the OS.
( 0: 0): Total reqs 51, Pending reqs 0, flags 0x0, 0:0:81 00
( 0: 1): Total reqs 57, Pending reqs 0, flags 0x0, 0:0:81 00
( 0: 2): Total reqs 57, Pending reqs 0, flags 0x0, 0:0:81 00
( 0: 3): Total reqs 57, Pending reqs 0, flags 0x0, 0:0:81 00




[root@hpcp1 ~]# more /proc/scsi/qla2xxx/3
QLogic PCI to Fibre Channel Host Adapter for HP A6826-60001:
	 Firmware version 3.03.18 IPX, Driver version 8.01.02-d3-debug
ISP: ISP2312, Serial# M19173
Request Queue = 0x4042700000, Response Queue = 0x40436a0000
Request Queue count = 2048, Response Queue count = 512
Total number of active commands = 0
Total number of interrupts = 355
     Device queue depth = 0x20
Number of free request entries = 1820
Number of mailbox timeouts = 0
Number of ISP aborts = 0
Number of loop resyncs = 0
Number of retries for empty slots = 0
Number of reqs in pending_q= 0, retry_q= 0, done_q= 0, scsi_retry_q= 0
Host adapter:loop state = <READY>, flags = 0x1a03
Dpc flags = 0x4000000
MBX flags = 0x0
Link down Timeout = 008
Port down retry = 016
Login retry count = 016
Commands retried with dropped frame(s) = 0
Product ID = 4953 5020 2020 0003


SCSI Device Information:
scsi-qla1-adapter-node=50060b000032659b;
scsi-qla1-adapter-port=50060b000032659a;
scsi-qla1-target-0=50060b00001470a2;

FC Port Information:
scsi-qla1-port-0=50060b000008af41:50060b00001470a2:010000:81;
scsi-qla1-port-1=50060b0000326599:50060b0000326598:010100:82;

SCSI LUN Information:
(Id:Lun)  * - indicates lun is not registered with the OS.
( 0: 0): Total reqs 55, Pending reqs 0, flags 0x0, 1:0:81 00
( 0: 1): Total reqs 57, Pending reqs 0, flags 0x0, 1:0:81 00
( 0: 2): Total reqs 57, Pending reqs 0, flags 0x0, 1:0:81 00
( 0: 3): Total reqs 57, Pending reqs 0, flags 0x0, 1:0:81 00

Comment 20 Mike Christie 2005-12-02 01:10:13 UTC

Created attachment 121726 [details]
bad patch

It looks like the problem is with the update to the update. Specifically it was
this patch.

Comment 27 Tom Coughlan 2005-12-12 15:45:24 UTC

*** Bug 175534 has been marked as a duplicate of this bug. ***

Comment 28 Doug Chapman 2005-12-12 16:22:31 UTC

FYI 2.6.9-24.1.EL does resolve my problem.  thanks.

Comment 29 John Shakshober 2005-12-12 16:59:59 UTC

Confirmed -24.1 worked for basic connnectivity to fiber luns

[root@perf3 ~]# uname -a
Linux perf3.lab.boston.redhat.com 2.6.9-24.1.ELsmp #1 SMP Fri Dec 9 14:27:54 EST
2005 x86_64 x86_64 x86_64 GNU/Linux

[root@perf3 ~]# mount -t ext3 /dev/sdl1 /perf1
[root@perf3 ~]# df /perf1
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdl1            122517848     93792 116200480   1% /perf1

Comment 30 Mike Christie 2005-12-12 18:54:39 UTC

John, does your comment mean it only worked in that one setup or does did it
work in the setup you originally reported the bug for?

Comment 31 John Shakshober 2005-12-12 19:23:47 UTC

Just tried it on the 2nd setup x86_64 (em64T)... 2.6.9-24.1.ELsmp works there too.

Performance runs on this one will happen sequentially to the 1st machine.

Comment 33 Red Hat Bugzilla 2006-03-07 20:53:22 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html

Note You need to log in before you can comment on or make changes to this bug.