Bug 144360 (IT_53953) - Fibre Channel tape speed regression (qla2200)
Summary: Fibre Channel tape speed regression (qla2200)
Keywords:
Status: CLOSED ERRATA
Alias: IT_53953
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Tom Coughlan
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 132991
TreeView+ depends on / blocked
 
Reported: 2005-01-06 15:18 UTC by Bastien Nocera
Modified: 2007-11-30 22:07 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-05-18 13:29:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
gzipped debug qla2200.o for 2.4.21-20.ELsmp (152.91 KB, application/octet-stream)
2005-01-10 15:42 UTC, Tom Coughlan
no flags Details
two QLogic drivers for testing on 2.4.21-20.ELsmp (298.36 KB, application/octet-stream)
2005-02-06 19:22 UTC, Tom Coughlan
no flags Details
patch to add a parameter to control RIO. (1.29 KB, patch)
2005-02-14 13:50 UTC, Tom Coughlan
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:294 0 normal SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 5 2005-05-18 04:00:00 UTC

Description Bastien Nocera 2005-01-06 15:18:30 UTC
1) On a 2.4.21-15.0.3.ELsmp kernel (qla2200 6.x driver):
$ time dd if=/dev/nst0 of=/dev/null bs=32760
32776+1 records in
32776+1 records out

real    0m33.555s
user    0m0.040s
sys     0m0.830s

2) On a 2.4.21-20.ELsmp kernel (qla2200 7.x driver):
$ time dd if=/dev/nst0 of=/dev/null bs=32760
32776+1 records in
32776+1 records out

real    1m1.070s
user    0m0.020s
sys     0m0.760s

Using the alternative 6.x driver on the -20.RL kernel (U3) gives the
expected performance (as in 1).
Using the 7.x driver in the -25.EL kernel (U4) halves the speed (as in 2).

Comment 8 Tom Coughlan 2005-01-10 15:42:55 UTC
Created attachment 109560 [details]
gzipped debug qla2200.o for 2.4.21-20.ELsmp

QLogic said they are looking into the problem and will give me status this
afternoon.

In the meantime, if the customer is willing to try this debug driver, it may
provide useful information.  The is the stock 2.4.21-20.ELsmp driver with  

#define  DEBUG_QLA2100		       1

in qla_settings.h.

Load the module with the module parameter: extended_error_logging=1

This will put lots of debug messages in /var/log/messages. It may indicate that
there are errors and retries occurring that are impacting performance. Please
post the result.

Comment 9 Tom Coughlan 2005-01-10 16:07:35 UTC
Just to be clear, please have them load the debug driver and run their
performance test. Then have them send the messages that show the driver being
loaded, and any messaages that occur during the test.

Comment 11 Bastien Nocera 2005-01-26 11:32:44 UTC
This is (apparently) all the output:

Jan 17 18:59:14 tpsrv127 kernel: qla2x00_set_info starts at address = f8a54060
Jan 17 18:59:14 tpsrv127 kernel: qla2x00: Found  VID=1077 DID=2200 SSVID=1077
SSDID=2
Jan 17 18:59:14 tpsrv127 kernel: scsi(0): Found a QLA2200  @ bus 1, device 0x2,
irq 26, iobase 0xf8839000
Jan 17 18:59:14 tpsrv127 kernel: scsi(0): 64 Bit PCI Addressing Enabled.
Jan 17 18:59:14 tpsrv127 kernel: scsi(0): Scatter/Gather entries= 32
Jan 17 18:59:15 tpsrv127 kernel: Configure PCI space for adapter...
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Configure NVRAM parameters...
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_nvram_config(0):operating mode=0
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_nvram_config link_down_timeout=0x0
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Verifying loaded RISC code...
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_initialize_adapter(): check if firmware
needs to be loaded
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Verifying chip...
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_setup_chip(): Verifying Check Sum of
loaded RISC code.
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_setup_chip(): CS Ok, Start firmware running
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Chip verified and RISC loaded...
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_init_rings(0): issue init firmware.
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): LIP reset occurred.
Jan 17 18:59:15 tpsrv127 kernel: scsi(0) qla2x00_isr: asynchronous MBA_LIP_RESET.
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_init_rings OK, call qla2x00_fw_ready...
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Waiting for LIP to complete...
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): LIP occurred.
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_isr(): asynchronous MBA_LIP_OCCURRED.
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): LOOP UP detected.
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_isr(): asynchronous MBA_LOOP_UP.
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Port database changed.
Jan 17 18:59:15 tpsrv127 kernel: scsi0 qla2x00_isr: asynchronous MBA_PORT_UPDATE.
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_get_retry_cnt(0): done. mb3=40 ratov=2.
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_fw_ready(0): F/W Ready - OK
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_fw_ready(0): fw_state=3 curr time=12a69.
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Topology - (Loop), Host Loop address 0x0
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_configure_loop(): dpc flags =0x1000100
Jan 17 18:59:15 tpsrv127 kernel: lun_discovery: Lun Inquiry - fcport=f787e380,
lun (0)
Jan 17 18:59:15 tpsrv127 kernel: lun_discovery(0): fcport lun count=1, fcport=
f787e380
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Enter
qla2x00_configure_fabric:hba=f7694100
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_mailbox_command(0): **** FAILED.
mbx0=4006, mbx1=7e, mbx2=b5d2, cmd=6a ****
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_get_port_name(0): failed=1.
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_configure_fabric(): MBC_GET_PORT_NAME
Failed, No FL Port
Jan 17 18:59:15 tpsrv127 kernel: Alloc Target 0 @ c3b98800
Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Assigning target ID=00 @ c3b98800 to
loop id=0x007d, port state=0x3, port down retry=60
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_configure_loop(0): LOOP READY
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_do_dpc(): Wake up parent -1
Jan 17 18:59:15 tpsrv127 kernel: DEBUG: detect hba 0 at address = f7694100 -
adding to hba list
Jan 17 18:59:15 tpsrv127 kernel: scsi-qla0-adapter-node=200000e08b0ce034;
Jan 17 18:59:15 tpsrv127 kernel: scsi-qla0-adapter-port=210000e08b0ce034;
Jan 17 18:59:15 tpsrv127 kernel: scsi-qla0-tgt-0-di-0-port=500104f0007766de;
Jan 17 18:59:15 tpsrv127 kernel: qla2x00_detect num_hosts=0
Jan 17 18:59:15 tpsrv127 kernel: apidev_init(): Creating (qla2200)
qla2200/HbaApiNode major=254
Jan 17 18:59:15 tpsrv127 kernel: scsi0 : QLogic QLA2200 PCI to Fibre Channel
Host Adapter: bus 1 device 2 irq 26
Jan 17 18:59:15 tpsrv127 kernel:         Firmware version:  2.02.06, Driver
version 7.00.03-RH1-debug
Jan 17 18:59:15 tpsrv127 kernel:
Jan 17 18:59:15 tpsrv127 kernel: blk: queue f6fa5c18, I/O limit 4294967295Mb
(mask 0xffffffffffffffff)
Jan 17 18:59:15 tpsrv127 kernel: qla0:0:0 UNDERRUN status detected 0x15-0x800.
Jan 17 18:59:15 tpsrv127 kernel:   Vendor: STK       Model: T9940B           
Rev: 1.34
Jan 17 18:59:15 tpsrv127 kernel:   Type:   Sequential-Access                 
ANSI SCSI revision: 03
Jan 17 18:59:15 tpsrv127 kernel: blk: queue f6fa5e18, I/O limit 4294967295Mb
(mask 0xffffffffffffffff)
Jan 17 18:59:15 tpsrv127 kernel: Attached scsi tape st0 at scsi0, channel 0, id
0, lun 0
Jan 17 18:59:30 tpsrv127 kernel: st0: Block limits 1 - 262144 bytes.


Comment 12 Tom Coughlan 2005-01-29 19:20:58 UTC
These messages show the driver being loaded. There is no indication 
of a problem here.  

Where there any subsequent messages, during the time that the tape 
I/O test was being run? 

Comment 15 Tom Coughlan 2005-02-06 19:22:44 UTC
Created attachment 110704 [details]
two QLogic drivers for testing on 2.4.21-20.ELsmp

QLogic has tried to reproduce the performance degradation with the v7 driver,
and they have not seen it. I don't have access to a FC tape, so I'd like to ask
the customer to try some test drivers for us. The attached tarball has two
drivers:

1) the latest update from QLogic, v7.04.00. I have no reason to believe this
will fix the problem, but it may, since it has all the latest fixes. 

2) the 7.00.03-RH1 driver from U3, with one change backed out:

--- qla2200/qla2x00.c~	2005-02-05 10:16:10.000000000 -0500
+++ qla2200/qla2x00.c	2005-02-05 12:34:27.000000000 -0500
@@ -7966,7 +7966,7 @@
		additional_firmware_options.enable_fc_confirm = 1;
 #endif
 #if defined(ISP2200)
-		additional_firmware_options.operation_mode = 4;
+/*		additional_firmware_options.operation_mode = 4; */
		icb->response_accum_timer = 3;
		icb->interrupt_delay_timer = 5;
 #endif

The v6 driver did not set this operation mode, so I'd like to see if it is
responsible for the performance problem.

Please run the tape performace test with these drivers and post the results.

Tom

Comment 19 Tom Coughlan 2005-02-11 20:09:11 UTC
Just let me know if you want 2.4.21-27 drivers.

Comment 22 Tom Coughlan 2005-02-14 13:43:23 UTC
The QLogic firmware implements a feature called Reduced Interrupt Operation
(RIO). This feature has several modes, controlled by the operation_mode setting. 

The V6 driver uses the NVRAM setting for operation_mode. The v7 driver forces
mode 4 for the qla2200. I would like to know which mode your adapter NVRAM is
in, since this mode is getting much better tape performance than mode 4. 

Please use the QLogic BIOS utility to determine which mode your adapter is set
for.  Press alt-Q duting power up self-test, select "configuration options" then
"extended firmware options" then "RIO Operation Mode". The adapter will be in
one of five modes. Please post the mode that your adapter is using. 

Comment 23 Tom Coughlan 2005-02-14 13:50:05 UTC
Created attachment 111052 [details]
patch to add a parameter to control RIO.

This is the patch I am currently testing for the 7.01.01 driver in RHEL 3 U5.

The customer must add this module parameter to modules.conf to restore the v6
RIO behavior:

ql2xriofromnvram=1

Comment 24 Ernie Petrides 2005-02-17 13:50:37 UTC
A fix for this problem has just been committed to the RHEL3 U5
patch pool this evening (in kernel version 2.4.21-27.15.EL).


Comment 25 Tim Powers 2005-05-18 13:29:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html


Comment 26 KELEMEN Péter 2005-05-20 08:10:40 UTC
Okay, finally I got to be around when a downtime was scheduled and could poke in
Fast!UTIL.  Apparently it uses RIO mode 0 (zero) with response timer / interrupt
delay timer both set to zero.

HTH,
Peter

Comment 27 KELEMEN Péter 2005-05-27 10:42:06 UTC
RHEL3 U5 drivers work nicely with the suggested bypass.  Mode 0 (64-bit by
default, as reported by the driver) works fine.

Thanks,
Peter


Note You need to log in before you can comment on or make changes to this bug.