Bug 316691
Summary: | qla2xxx driver failing w/ realtime kernel | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Luis Claudio R. Goncalves <lgoncalv> | ||||
Component: | realtime-kernel | Assignee: | Marcus Barrow <mbarrow> | ||||
Status: | CLOSED WORKSFORME | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 1.0 | CC: | jburke, mbarrow | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-04-23 20:17:50 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Luis Claudio R. Goncalves
2007-10-03 11:54:32 UTC
The fixed kernel survived on PERF5 last night ... but not on et-virt03/4 [root@perf5 aio]# ./aio-stress -s 128m -r 64k -d 1 -O -t 1 /perf1/t1 dropping io_iter to 1 file size 128MB, record size 64KB, depth 1, ios per iteration 1 max io_submit 1, buffer alignment set to 4KB threads 1 files 1 contexts 1 context offset 2MB verification off adding file /perf1/t1 thread 0 write on /perf1/t1 (111.29 MB/s) 128.00 MB in 1.15s thread 0 write totals (109.84 MB/s) 128.00 MB in 1.17s read on /perf1/t1 (123.23 MB/s) 128.00 MB in 1.04s thread 0 read totals (122.24 MB/s) 128.00 MB in 1.05s random write on /perf1/t1 (115.67 MB/s) 128.00 MB in 1.11s thread 0 random write totals (114.80 MB/s) 128.00 MB in 1.11s random read on /perf1/t1 (116.95 MB/s) 128.00 MB in 1.09s thread 0 random read totals (116.06 MB/s) 128.00 MB in 1.10s [root@perf5 aio]# ./aio-stress -s 128m -r 64k -d 1 -t 1 /perf1/t1 dropping io_iter to 1 file size 128MB, record size 64KB, depth 1, ios per iteration 1 max io_submit 1, buffer alignment set to 4KB threads 1 files 1 contexts 1 context offset 2MB verification off adding file /perf1/t1 thread 0 write on /perf1/t1 (626.91 MB/s) 128.00 MB in 0.20s thread 0 write totals (148.20 MB/s) 128.00 MB in 0.86s read on /perf1/t1 (1174.57 MB/s) 128.00 MB in 0.11s thread 0 read totals (1166.62 MB/s) 128.00 MB in 0.11s random write on /perf1/t1 (862.16 MB/s) 128.00 MB in 0.15s thread 0 random write totals (226.48 MB/s) 128.00 MB in 0.57s random read on /perf1/t1 (1214.31 MB/s) 128.00 MB in 0.11s thread 0 random read totals (1205.89 MB/s) 128.00 MB in 0.11s [root@perf5 aio]# ./aio-stress -s 128m -r 64k -d 8 -t 1 /perf1/t1 file size 128MB, record size 64KB, depth 8, ios per iteration 8 max io_submit 8, buffer alignment set to 4KB threads 1 files 1 contexts 1 context offset 2MB verification off adding file /perf1/t1 thread 0 write on /perf1/t1 (888.01 MB/s) 128.00 MB in 0.14s thread 0 write totals (158.88 MB/s) 128.00 MB in 0.81s read on /perf1/t1 (1122.06 MB/s) 128.00 MB in 0.11s thread 0 read totals (1115.04 MB/s) 128.00 MB in 0.11s random write on /perf1/t1 (952.16 MB/s) 128.00 MB in 0.13s thread 0 random write totals (231.62 MB/s) 128.00 MB in 0.55s random read on /perf1/t1 (1117.60 MB/s) 128.00 MB in 0.11s thread 0 random read totals (1117.16 MB/s) 128.00 MB in 0.11s [root@perf5 aio]# ./aio-stress -s 128m -r 64k -d 8 -O -t 1 /perf1/t1 file size 128MB, record size 64KB, depth 8, ios per iteration 8 max io_submit 8, buffer alignment set to 4KB threads 1 files 1 contexts 1 context offset 2MB verification off adding file /perf1/t1 thread 0 write on /perf1/t1 (174.99 MB/s) 128.00 MB in 0.73s thread 0 write totals (172.77 MB/s) 128.00 MB in 0.74s read on /perf1/t1 (158.71 MB/s) 128.00 MB in 0.81s thread 0 read totals (157.04 MB/s) 128.00 MB in 0.82s random write on /perf1/t1 (168.78 MB/s) 128.00 MB in 0.76s thread 0 random write totals (166.89 MB/s) 128.00 MB in 0.77s random read on /perf1/t1 (150.40 MB/s) 128.00 MB in 0.85s thread 0 random read totals (149.02 MB/s) 128.00 MB in 0.86s [root@perf5 aio]# uname -a Linux perf5.lab.boston.redhat.com 2.6.21-38_fix_.fc7 #1 SMP PREEMPT RT Mon Sep 24 13:41:17 BRT 2007 x86_64 x86_64 x86_64 GNU/Linux Differences perf5 et-virt03/4 AMD64 8-cpu 2.8Ghz Intel Xeon 3.4 Ghz 4 GB memory 4 GB memory IOmmu no-IOmmu lpfc (emulex FC card) qla2462 (qlogic FC card) msa1000 8disk R0 - 15k rpm (same) clustered GFS 5.1 Standalone 5.1 upgraded many times from FC-to-5.0-5.1-RT The fixed kernel Shak talks about is 2.6.21-38.el5rt + aio fix. No changes regarding Fiber Channel. I have run the infamous dd-of-death in three different systems and only one misbehaved... One of them had the same qla2462 FC car but didn't present any problem. rt-amd-01 - OK dell-pe1950-05 - OK et-virt04 - died horribly when I tried both dd-o'-death versions simultaneously. dd-of-death: 1: dd if=/dev/sdN /dev/null -> sdN points to the fc lun 2: dd if=/dev/sda of=/mnt/fc_lun_mountpoint/BIG_testfile.dat -> sda is !FC lun. In other words, copy something big. Created attachment 230391 [details]
Dmesg log for et-virt04 hanging on dd-of-death
A compilation of valuable data Jeff Burke observed in his tests: The highlights are: * The issue seems to be qla2xxx specific. Other controllers showed no problems in this test. * Jeff got some cool stack backtraces. <jburke> Now running and I got this message Oct 24 16:40:18 et-virt04 kernel: qla2xxx_eh_abort(1): aborting sp ffff8100667ba500 from RISC. pid=49030. <jburke> here is the top output <jburke> 4012 root 15 0 12716 1136 796 R 0 0.1 0:04.71 top <jburke> 509 root 15 -5 0 0 0 D 0 0.0 0:00.00 scsi_eh_1 <jburke> 4108 root 18 0 1780 444 276 D 0 0.0 0:00.39 dt <jburke> 4109 root 18 0 1780 444 276 D 0 0.0 0:00.35 dt <jburke> 4110 root 18 0 1780 444 276 D 0 0.0 0:00.34 dt <jburke> 4111 root 18 0 1780 444 276 D 0 0.0 0:00.33 dt <jburke> 4112 root 18 0 1780 444 276 D 0 0.0 0:00.36 dt <jburke> 4113 root 18 0 1780 444 276 D 0 0.0 0:00.33 dt <jburke> 4114 root 18 0 1780 444 276 D 0 0.0 0:00.33 dt <jburke> 4115 root 18 0 1780 444 276 D 0 0.0 0:00.36 dt <jburke> 4116 root 18 0 1780 444 276 D 0 0.0 0:00.34 dt <jburke> As soon as that message came in "qla2xxx_eh_abort(1): aborting sp ffff8100667ba500 from RISC. pid=49030." <lclaudio> jburke, there should be another message right after this one with the scsi command aborted... <jburke> Nope not coming out <jburke> As soon as that message came in "qla2xxx_eh_abort(1): aborting sp ffff8100667ba500 from RISC. pid=49030." <jburke> scsi_eh_1 process jumped into top <lclaudio> jburke, drivers/scsi/qla2xxx/qla_dbg.h -> debuglevels are defined at compilation time. <lclaudio> oops <lclaudio> that's the code: <lclaudio> DEBUG2(printk("%s(%ld): aborting sp %p from RISC. pid=%ld.\n", <lclaudio> __func__, ha->host_no, sp, serial)); <lclaudio> DEBUG3(qla2x00_print_scsi_cmd(cmd)); <lclaudio> qla_os.c : 674 <lclaudio> anyway, DEBUG3 is void as of now. <lclaudio> jburke, you can uncomment the defines in qla_dbg.h - recompile the module - and get more info. <lclaudio> lots more. <jburke> __lc, It looks like it is a scsi error path issue <jburke> __lc, All of the dt tasks are blocked behind the scsi_eh_1 <jburke> SysRq : Show Blocked State <jburke> free sibling <jburke> task PC stack pid father child younger older <jburke> scsi_eh_1 D [ffff8100041ff040] ffff81007e4eeb70 0 509 59 (L-TLB) <jburke> ffff81007e4f1b70 0000000000000046 0000000000000003 ffff81007f57b810 <jburke> ffff810003d95820 ffff8100041ff290 ffff8100041ff040 ffff81007f57b810 <jburke> 000002315e371e70 00000000000074d9 0000000100000003 0000000300000003 <jburke> Call Trace: <jburke> [<ffffffff80264714>] schedule+0xe2/0x102 <jburke> [<ffffffff80265d84>] __compat_down+0xda/0xef <jburke> [<ffffffff802659da>] __compat_down_failed+0x35/0x3a <jburke> [<ffffffff8812188c>] :qla2xxx:qla2x00_mailbox_command+0x1e8/0x51d <jburke> [<ffffffff88122b8b>] :qla2xxx:qla2x00_issue_iocb+0x5f/0xc9 <jburke> [<ffffffff88123426>] :qla2xxx:qla24xx_abort_command+0xfd/0x1b7 <jburke> [<ffffffff8811ab37>] :qla2xxx:qla2xxx_eh_abort+0xba/0x1fa <jburke> [<ffffffff8807db89>] :scsi_mod:__scsi_try_to_abort_cmd+0x21/0x23 <jburke> [<ffffffff8807f2e2>] :scsi_mod:scsi_error_handler+0x2a1/0x4cf <jburke> [<ffffffff80233db6>] kthread+0xf5/0x128 <jburke> [<ffffffff8025ff68>] child_rip+0xa/0x12 <jburke> __lc, This is a driver issue with this card -> Fibre Channel: QLogic Corp. ISP2422-based 4Gb Fibre Channel to PCI-X HBA (rev 02) Markus & friends back at QLogic are pursuing this issue currently. Attempts to reproduce at Qlogic have failed, so far. Could people provide information on the hardware systems that are experiencing failure? et-virt04 PCI-X 133 MHZ fails dell-pe1950-05 PCI-e good thanks, being able to reproduce at the factory would be a big help... Marcus Since we haven't seen any further reports of FC failures in the 2.6.24.x series of kernels, we'll close this for now. Clark |