From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050719 Red Hat/1.0.6-1.4.1 Firefox/1.0.6 Description of problem: Starting with 2.6.9-22.24.EL I am seeing SCSI errors from an HP storage array connected via a qlogic FC on my HP Integrity ia64 system. Rebooting back to 2.6.9-22.23.EL cleans the problem up. The later kernels up to kernel-2.6.9-22.27.EL have this problem as well. The errors are seen during bootup when LVM is looking for volumes: Scanning logical volumes Reading all physical volumes. This may take a while... SCSI error : <2 0 0 1> return code = 0x20000 end_request: I/O error, dev sdd, sector 0 Buffer I/O error on device sdd, logical block 0 SCSI error : <2 0 0 1> return code = 0x20000 end_request: I/O error, dev sdd, sector 32 Buffer I/O error on device sdd, logical block 1 SCSI error : <2 0 0 1> return code = 0x20000 end_request: I/O error, dev sdd, sector 64 Buffer I/O error on device sdd, logical block 2 SCSI error : <2 0 0 1> return code = 0x20000 end_request: I/O error, dev sdd, sector 96 Buffer I/O error on device sdd, logical block 3 SCSI error : <2 0 0 1> return code = 0x20000 end_request: I/O error, dev sdd, sector 0 Buffer I/O error on device sdd, logical block 0 /dev/sdd: read failed after 0 of 16384 at 0: Input/output error SCSI error : <2 0 0 1> return code = 0x20000 end_request: I/O error, dev sdd, sector 104857472 /dev/sdd: read failed after 0 of 16384 at 53687025664: Input/output error SCSI error : <2 0 0 1> return code = 0x20000 end_request: I/O error, dev sdd, sector 0 /dev/sdd: read failed after 0 of 16384 at 0: Input/output error SCSI error : <2 0 0 2> return code = 0x20000 end_request: I/O error, dev sde, sector 0 Buffer I/O error on device sde, logical block 0 SCSI error : <2 0 0 2> return code = 0x20000 end_request: I/O error, dev sde, sector 32 Buffer I/O error on device sde, logical block 1 SCSI error : <2 0 0 2> return code = 0x20000 end_request: I/O error, dev sde, sector 64 ... and so on ... The messages continue to occur every once booted every few minutes while idle. Version-Release number of selected component (if applicable): kernel-2.6.9-22.24.EL How reproducible: Always Steps to Reproduce: 1. Boot an HP Integrity system with qlogic FC connected to an HP VA7100 array 2. 3. Actual Results: SCSI errors and very slow booting Additional info:
Looks like it might be a connection problem. Could you send the rest of the log? Are there any other qlogic/qla2xxx or scsi messages in the log?
It looks like the larger qlogic update got in at 22.11 here https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=168544 And all that got merged into 22.24 was https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=149294 transport rediscovery. I was going to ask you to back out that patch, but the patch does not touch the drivers core code. It just adds a sysfs attribute.
Created attachment 121603 [details] dmesg from booting with scsi errs
I have attached the dmesg output from booting with 2.6.9-23. I don't notice any other SCSI messages beyond what I had posted however I am no storage expert. If the qlogic driver did not change then perhaps it is a dm issue? These disks are dual path. I will pull one path and reboot to see if the problem goes away.
The problems go away when I remove the second FC path to the disks. Looks like this is probably a device mapper bug, not a qlogic driver bug. Sorry for my incorret assumption on this.
I see this was set as closed. Did I accidentally do that? It is still a problem, just not a qlogic driver problem. It needs to be assigned to the dm maintainer.
ok. i'm not sure if this is a dm problem
Can you verify that with the setup that works for pre .24 kernels everything was plugged in, or did you have it setup with only one path origianlly when it worked? From comment #5 it looks like the storage might be some sort of Active/Passive if it turned out that IO to one path works but IO to the other did not. I did not think we would get 0x20000 errors though (0x20000 is DID_BUS_BUSY right). Is the path we get failed IO configured differently or setup through more switches or anything like that? Could you also just test this with the newest kernel with all the new qlogic patches?
Previous to .24 it worked fine and was using both paths as Active-Active. I will try with the latest kernel and the older qlogic driver tomorrow. I am out of the office today and I had pulled a cable to work around this temporarily. I will try this out first thing tomorrow.
I am having difficulty getting the 22.23 driver to load on 2.6.9-22.27. Did we have a kernel abi change for fc_attach_transport? This is quite possibly user error on my part so here is what I have done and what I am seeing: 1. copied kernel/drivers/scsi/qla1280.ko and kernel/drivers/scsi/qla2xxx/* from the 2.6.9-22.23 tree to the 2.6.9-22.27 tree. 2. made a new initrd image via: mkinitrd /boot/efi/efi/redhat/initrd-2.6.9-22.27.EL-oldql.img 2.6.9-22.27.EL 3. modified elilo.conf to use this initrd with the 2.6.9.22-27 kernel 4. booted, ensured it was using the right initrd image During bootup, or when I try loading the modules manually later I get: ksign: module signed with unknown public key - signature keyid: 681ed4f9b8c9fe81 ver=3 qla2xxx: disagrees about version of symbol fc_attach_transport qla2xxx: Unknown symbol fc_attach_transport ksign: module signed with unknown public key - signature keyid: 681ed4f9b8c9fe81 ver=3 qla2300: Unknown symbol qla2x00_remove_one qla2300: Unknown symbol qla2x00_probe_one I see that fc_attach_transport is in scsi_transport_fc and that module is loaded . Do we have a kabi change causing a problem here?
that's right. I guess you could try the old scsi_transport_fc module as well. If that still doesn't work, i guess we could try backing out the qlogic changes.
You need the transport class module scsi_transport_fc too. But could you also try the newest kernel with the driver that is in there? Just in case it is a qlogic bug that got fixed.
More updates: The latest kernel - 2.6.9-24 exhibits the same problem. I booted 2.6.9-22.27 with the qlogic driver and scsi_transport_fc from 2.6.9-22.23 and it worked OK. The version of the qlogic driver is 8.01.02-d2 where the later versions where it doesn't work is 8.01.02-d3. So, I would say this definitly points to a driver regression.
What qlogic card was this with? Could I get access to the machine? The logs have nothing too useful. I want to just peal of the patches that got merged and see which one it is.
HP refers to the card as: "PCI-X dual Channel 2Gb Fibre Channel HBA" A6826A I assume qlogic has another name for it as well. I am moving my system out of my private net onto the .lab.boston.redhat.com net so you can access it. I will send you that info via email.
I must have looked at the commit messages wrong. Ignore my comment #22.
That should have been comment #2.
Created attachment 121716 [details] qla2xxx log Attaching log for qlogic guys. Here is the proc output requested too. [root@hpcp1 ~]# more /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: HP 36.4G Model: MAU3036NC Rev: HPC2 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi0 Channel: 00 Id: 01 Lun: 00 Vendor: HP 36.4G Model: MAU3036NC Rev: HPC2 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: HP Model: A6188A Rev: HP14 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 00 Lun: 01 Vendor: HP Model: A6188A Rev: HP14 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 00 Lun: 02 Vendor: HP Model: A6188A Rev: HP14 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi2 Channel: 00 Id: 00 Lun: 03 Vendor: HP Model: A6188A Rev: HP14 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 00 Lun: 00 Vendor: HP Model: A6188A Rev: HP14 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 00 Lun: 01 Vendor: HP Model: A6188A Rev: HP14 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 00 Lun: 02 Vendor: HP Model: A6188A Rev: HP14 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi3 Channel: 00 Id: 00 Lun: 03 Vendor: HP Model: A6188A Rev: HP14 /proc/scsi/qla2xxx/<id> output there are two hosts. IO to host2 is the problem. [root@hpcp1 ~]# more /proc/scsi/qla2xxx/2 QLogic PCI to Fibre Channel Host Adapter for HP A6826-60001: Firmware version 3.03.18 IPX, Driver version 8.01.02-d3-debug ISP: ISP2312, Serial# M18661 Request Queue = 0x3cd00000, Response Queue = 0x40424a0000 Request Queue count = 2048, Response Queue count = 512 Total number of active commands = 0 Total number of interrupts = 366 Device queue depth = 0x20 Number of free request entries = 1822 Number of mailbox timeouts = 0 Number of ISP aborts = 0 Number of loop resyncs = 0 Number of retries for empty slots = 0 Number of reqs in pending_q= 0, retry_q= 0, done_q= 0, scsi_retry_q= 0 Host adapter:loop state = <UPDATE>, flags = 0x1a03 Dpc flags = 0x4000000 MBX flags = 0x0 Link down Timeout = 008 Port down retry = 016 Login retry count = 016 Commands retried with dropped frame(s) = 0 Product ID = 4953 5020 2020 0003 SCSI Device Information: scsi-qla0-adapter-node=50060b0000326599; scsi-qla0-adapter-port=50060b0000326598; scsi-qla0-target-0=50060b00001470a2; FC Port Information: scsi-qla0-port-0=50060b000008af41:50060b00001470a2:010000:81; scsi-qla0-port-1=50060b000032659b:50060b000032659a:010500:82; SCSI LUN Information: (Id:Lun) * - indicates lun is not registered with the OS. ( 0: 0): Total reqs 51, Pending reqs 0, flags 0x0, 0:0:81 00 ( 0: 1): Total reqs 57, Pending reqs 0, flags 0x0, 0:0:81 00 ( 0: 2): Total reqs 57, Pending reqs 0, flags 0x0, 0:0:81 00 ( 0: 3): Total reqs 57, Pending reqs 0, flags 0x0, 0:0:81 00 [root@hpcp1 ~]# more /proc/scsi/qla2xxx/3 QLogic PCI to Fibre Channel Host Adapter for HP A6826-60001: Firmware version 3.03.18 IPX, Driver version 8.01.02-d3-debug ISP: ISP2312, Serial# M19173 Request Queue = 0x4042700000, Response Queue = 0x40436a0000 Request Queue count = 2048, Response Queue count = 512 Total number of active commands = 0 Total number of interrupts = 355 Device queue depth = 0x20 Number of free request entries = 1820 Number of mailbox timeouts = 0 Number of ISP aborts = 0 Number of loop resyncs = 0 Number of retries for empty slots = 0 Number of reqs in pending_q= 0, retry_q= 0, done_q= 0, scsi_retry_q= 0 Host adapter:loop state = <READY>, flags = 0x1a03 Dpc flags = 0x4000000 MBX flags = 0x0 Link down Timeout = 008 Port down retry = 016 Login retry count = 016 Commands retried with dropped frame(s) = 0 Product ID = 4953 5020 2020 0003 SCSI Device Information: scsi-qla1-adapter-node=50060b000032659b; scsi-qla1-adapter-port=50060b000032659a; scsi-qla1-target-0=50060b00001470a2; FC Port Information: scsi-qla1-port-0=50060b000008af41:50060b00001470a2:010000:81; scsi-qla1-port-1=50060b0000326599:50060b0000326598:010100:82; SCSI LUN Information: (Id:Lun) * - indicates lun is not registered with the OS. ( 0: 0): Total reqs 55, Pending reqs 0, flags 0x0, 1:0:81 00 ( 0: 1): Total reqs 57, Pending reqs 0, flags 0x0, 1:0:81 00 ( 0: 2): Total reqs 57, Pending reqs 0, flags 0x0, 1:0:81 00 ( 0: 3): Total reqs 57, Pending reqs 0, flags 0x0, 1:0:81 00
Created attachment 121726 [details] bad patch It looks like the problem is with the update to the update. Specifically it was this patch.
*** Bug 175534 has been marked as a duplicate of this bug. ***
FYI 2.6.9-24.1.EL does resolve my problem. thanks.
Confirmed -24.1 worked for basic connnectivity to fiber luns [root@perf3 ~]# uname -a Linux perf3.lab.boston.redhat.com 2.6.9-24.1.ELsmp #1 SMP Fri Dec 9 14:27:54 EST 2005 x86_64 x86_64 x86_64 GNU/Linux [root@perf3 ~]# mount -t ext3 /dev/sdl1 /perf1 [root@perf3 ~]# df /perf1 Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdl1 122517848 93792 116200480 1% /perf1
John, does your comment mean it only worked in that one setup or does did it work in the setup you originally reported the bug for?
Just tried it on the 2nd setup x86_64 (em64T)... 2.6.9-24.1.ELsmp works there too. Performance runs on this one will happen sequentially to the 1st machine.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html