1) On a 2.4.21-15.0.3.ELsmp kernel (qla2200 6.x driver): $ time dd if=/dev/nst0 of=/dev/null bs=32760 32776+1 records in 32776+1 records out real 0m33.555s user 0m0.040s sys 0m0.830s 2) On a 2.4.21-20.ELsmp kernel (qla2200 7.x driver): $ time dd if=/dev/nst0 of=/dev/null bs=32760 32776+1 records in 32776+1 records out real 1m1.070s user 0m0.020s sys 0m0.760s Using the alternative 6.x driver on the -20.RL kernel (U3) gives the expected performance (as in 1). Using the 7.x driver in the -25.EL kernel (U4) halves the speed (as in 2).
Created attachment 109560 [details] gzipped debug qla2200.o for 2.4.21-20.ELsmp QLogic said they are looking into the problem and will give me status this afternoon. In the meantime, if the customer is willing to try this debug driver, it may provide useful information. The is the stock 2.4.21-20.ELsmp driver with #define DEBUG_QLA2100 1 in qla_settings.h. Load the module with the module parameter: extended_error_logging=1 This will put lots of debug messages in /var/log/messages. It may indicate that there are errors and retries occurring that are impacting performance. Please post the result.
Just to be clear, please have them load the debug driver and run their performance test. Then have them send the messages that show the driver being loaded, and any messaages that occur during the test.
This is (apparently) all the output: Jan 17 18:59:14 tpsrv127 kernel: qla2x00_set_info starts at address = f8a54060 Jan 17 18:59:14 tpsrv127 kernel: qla2x00: Found VID=1077 DID=2200 SSVID=1077 SSDID=2 Jan 17 18:59:14 tpsrv127 kernel: scsi(0): Found a QLA2200 @ bus 1, device 0x2, irq 26, iobase 0xf8839000 Jan 17 18:59:14 tpsrv127 kernel: scsi(0): 64 Bit PCI Addressing Enabled. Jan 17 18:59:14 tpsrv127 kernel: scsi(0): Scatter/Gather entries= 32 Jan 17 18:59:15 tpsrv127 kernel: Configure PCI space for adapter... Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Configure NVRAM parameters... Jan 17 18:59:15 tpsrv127 kernel: qla2x00_nvram_config(0):operating mode=0 Jan 17 18:59:15 tpsrv127 kernel: qla2x00_nvram_config link_down_timeout=0x0 Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Verifying loaded RISC code... Jan 17 18:59:15 tpsrv127 kernel: qla2x00_initialize_adapter(): check if firmware needs to be loaded Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Verifying chip... Jan 17 18:59:15 tpsrv127 kernel: qla2x00_setup_chip(): Verifying Check Sum of loaded RISC code. Jan 17 18:59:15 tpsrv127 kernel: qla2x00_setup_chip(): CS Ok, Start firmware running Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Chip verified and RISC loaded... Jan 17 18:59:15 tpsrv127 kernel: qla2x00_init_rings(0): issue init firmware. Jan 17 18:59:15 tpsrv127 kernel: scsi(0): LIP reset occurred. Jan 17 18:59:15 tpsrv127 kernel: scsi(0) qla2x00_isr: asynchronous MBA_LIP_RESET. Jan 17 18:59:15 tpsrv127 kernel: qla2x00_init_rings OK, call qla2x00_fw_ready... Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Waiting for LIP to complete... Jan 17 18:59:15 tpsrv127 kernel: scsi(0): LIP occurred. Jan 17 18:59:15 tpsrv127 kernel: qla2x00_isr(): asynchronous MBA_LIP_OCCURRED. Jan 17 18:59:15 tpsrv127 kernel: scsi(0): LOOP UP detected. Jan 17 18:59:15 tpsrv127 kernel: qla2x00_isr(): asynchronous MBA_LOOP_UP. Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Port database changed. Jan 17 18:59:15 tpsrv127 kernel: scsi0 qla2x00_isr: asynchronous MBA_PORT_UPDATE. Jan 17 18:59:15 tpsrv127 kernel: qla2x00_get_retry_cnt(0): done. mb3=40 ratov=2. Jan 17 18:59:15 tpsrv127 kernel: qla2x00_fw_ready(0): F/W Ready - OK Jan 17 18:59:15 tpsrv127 kernel: qla2x00_fw_ready(0): fw_state=3 curr time=12a69. Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Topology - (Loop), Host Loop address 0x0 Jan 17 18:59:15 tpsrv127 kernel: qla2x00_configure_loop(): dpc flags =0x1000100 Jan 17 18:59:15 tpsrv127 kernel: lun_discovery: Lun Inquiry - fcport=f787e380, lun (0) Jan 17 18:59:15 tpsrv127 kernel: lun_discovery(0): fcport lun count=1, fcport= f787e380 Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Enter qla2x00_configure_fabric:hba=f7694100 Jan 17 18:59:15 tpsrv127 kernel: qla2x00_mailbox_command(0): **** FAILED. mbx0=4006, mbx1=7e, mbx2=b5d2, cmd=6a **** Jan 17 18:59:15 tpsrv127 kernel: qla2x00_get_port_name(0): failed=1. Jan 17 18:59:15 tpsrv127 kernel: qla2x00_configure_fabric(): MBC_GET_PORT_NAME Failed, No FL Port Jan 17 18:59:15 tpsrv127 kernel: Alloc Target 0 @ c3b98800 Jan 17 18:59:15 tpsrv127 kernel: scsi(0): Assigning target ID=00 @ c3b98800 to loop id=0x007d, port state=0x3, port down retry=60 Jan 17 18:59:15 tpsrv127 kernel: qla2x00_configure_loop(0): LOOP READY Jan 17 18:59:15 tpsrv127 kernel: qla2x00_do_dpc(): Wake up parent -1 Jan 17 18:59:15 tpsrv127 kernel: DEBUG: detect hba 0 at address = f7694100 - adding to hba list Jan 17 18:59:15 tpsrv127 kernel: scsi-qla0-adapter-node=200000e08b0ce034; Jan 17 18:59:15 tpsrv127 kernel: scsi-qla0-adapter-port=210000e08b0ce034; Jan 17 18:59:15 tpsrv127 kernel: scsi-qla0-tgt-0-di-0-port=500104f0007766de; Jan 17 18:59:15 tpsrv127 kernel: qla2x00_detect num_hosts=0 Jan 17 18:59:15 tpsrv127 kernel: apidev_init(): Creating (qla2200) qla2200/HbaApiNode major=254 Jan 17 18:59:15 tpsrv127 kernel: scsi0 : QLogic QLA2200 PCI to Fibre Channel Host Adapter: bus 1 device 2 irq 26 Jan 17 18:59:15 tpsrv127 kernel: Firmware version: 2.02.06, Driver version 7.00.03-RH1-debug Jan 17 18:59:15 tpsrv127 kernel: Jan 17 18:59:15 tpsrv127 kernel: blk: queue f6fa5c18, I/O limit 4294967295Mb (mask 0xffffffffffffffff) Jan 17 18:59:15 tpsrv127 kernel: qla0:0:0 UNDERRUN status detected 0x15-0x800. Jan 17 18:59:15 tpsrv127 kernel: Vendor: STK Model: T9940B Rev: 1.34 Jan 17 18:59:15 tpsrv127 kernel: Type: Sequential-Access ANSI SCSI revision: 03 Jan 17 18:59:15 tpsrv127 kernel: blk: queue f6fa5e18, I/O limit 4294967295Mb (mask 0xffffffffffffffff) Jan 17 18:59:15 tpsrv127 kernel: Attached scsi tape st0 at scsi0, channel 0, id 0, lun 0 Jan 17 18:59:30 tpsrv127 kernel: st0: Block limits 1 - 262144 bytes.
These messages show the driver being loaded. There is no indication of a problem here. Where there any subsequent messages, during the time that the tape I/O test was being run?
Created attachment 110704 [details] two QLogic drivers for testing on 2.4.21-20.ELsmp QLogic has tried to reproduce the performance degradation with the v7 driver, and they have not seen it. I don't have access to a FC tape, so I'd like to ask the customer to try some test drivers for us. The attached tarball has two drivers: 1) the latest update from QLogic, v7.04.00. I have no reason to believe this will fix the problem, but it may, since it has all the latest fixes. 2) the 7.00.03-RH1 driver from U3, with one change backed out: --- qla2200/qla2x00.c~ 2005-02-05 10:16:10.000000000 -0500 +++ qla2200/qla2x00.c 2005-02-05 12:34:27.000000000 -0500 @@ -7966,7 +7966,7 @@ additional_firmware_options.enable_fc_confirm = 1; #endif #if defined(ISP2200) - additional_firmware_options.operation_mode = 4; +/* additional_firmware_options.operation_mode = 4; */ icb->response_accum_timer = 3; icb->interrupt_delay_timer = 5; #endif The v6 driver did not set this operation mode, so I'd like to see if it is responsible for the performance problem. Please run the tape performace test with these drivers and post the results. Tom
Just let me know if you want 2.4.21-27 drivers.
The QLogic firmware implements a feature called Reduced Interrupt Operation (RIO). This feature has several modes, controlled by the operation_mode setting. The V6 driver uses the NVRAM setting for operation_mode. The v7 driver forces mode 4 for the qla2200. I would like to know which mode your adapter NVRAM is in, since this mode is getting much better tape performance than mode 4. Please use the QLogic BIOS utility to determine which mode your adapter is set for. Press alt-Q duting power up self-test, select "configuration options" then "extended firmware options" then "RIO Operation Mode". The adapter will be in one of five modes. Please post the mode that your adapter is using.
Created attachment 111052 [details] patch to add a parameter to control RIO. This is the patch I am currently testing for the 7.01.01 driver in RHEL 3 U5. The customer must add this module parameter to modules.conf to restore the v6 RIO behavior: ql2xriofromnvram=1
A fix for this problem has just been committed to the RHEL3 U5 patch pool this evening (in kernel version 2.4.21-27.15.EL).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html
Okay, finally I got to be around when a downtime was scheduled and could poke in Fast!UTIL. Apparently it uses RIO mode 0 (zero) with response timer / interrupt delay timer both set to zero. HTH, Peter
RHEL3 U5 drivers work nicely with the suggested bypass. Mode 0 (64-bit by default, as reported by the driver) works fine. Thanks, Peter