Bug 316691

Summary: qla2xxx driver failing w/ realtime kernel
Product: Red Hat Enterprise MRG Reporter: Luis Claudio R. Goncalves <lgoncalv>
Component: realtime-kernelAssignee: Marcus Barrow <mbarrow>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 1.0CC: jburke, mbarrow
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-04-23 20:17:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Dmesg log for et-virt04 hanging on dd-of-death none

Description Luis Claudio R. Goncalves 2007-10-03 11:54:32 UTC
Observed with:
2.6.21-39.el5rt
2.6.21-38.el5rt + aio fix from BZ#263461

Problem:
When using one of the above RT kernels and accessing a Fiber Channel lun,
eventually disk operations will fail and filesystem will end up corrupted.
One of the tests was a mkfs to a fiber lun, in a fresh booted machine, and the
process hung. In some cases even a ls on the fiber lun would hung. After some of
the tests, disk stress, the filesystem was corrupted. We were able to hung the
machine issuing acme's uber-simple-killer:

    dd if=/dev/sdc4 of=/dev/null

The tests were conducted in three of Shak's boxes, using two different fiber
channel SAN. The boxes are: et-virt03, et-virt04 and perf5. The fiber lun used
was /dev/sdc4 mounted on /perf1.

NOTE: this problem showed up while we were chasing an aio bug. So we have to
perform specific tests to get more information on the subject.

Comment 1 John Shakshober 2007-10-04 21:41:25 UTC
The fixed kernel survived on PERF5 last night ... but not on et-virt03/4

[root@perf5 aio]# ./aio-stress -s 128m -r 64k -d 1 -O -t 1 /perf1/t1
dropping io_iter to 1
file size 128MB, record size 64KB, depth 1, ios per iteration 1
max io_submit 1, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /perf1/t1 thread 0
write on /perf1/t1 (111.29 MB/s) 128.00 MB in 1.15s
thread 0 write totals (109.84 MB/s) 128.00 MB in 1.17s
read on /perf1/t1 (123.23 MB/s) 128.00 MB in 1.04s
thread 0 read totals (122.24 MB/s) 128.00 MB in 1.05s
random write on /perf1/t1 (115.67 MB/s) 128.00 MB in 1.11s
thread 0 random write totals (114.80 MB/s) 128.00 MB in 1.11s
random read on /perf1/t1 (116.95 MB/s) 128.00 MB in 1.09s
thread 0 random read totals (116.06 MB/s) 128.00 MB in 1.10s
[root@perf5 aio]# ./aio-stress -s 128m -r 64k -d 1  -t 1 /perf1/t1
dropping io_iter to 1
file size 128MB, record size 64KB, depth 1, ios per iteration 1
max io_submit 1, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /perf1/t1 thread 0
write on /perf1/t1 (626.91 MB/s) 128.00 MB in 0.20s
thread 0 write totals (148.20 MB/s) 128.00 MB in 0.86s
read on /perf1/t1 (1174.57 MB/s) 128.00 MB in 0.11s
thread 0 read totals (1166.62 MB/s) 128.00 MB in 0.11s
random write on /perf1/t1 (862.16 MB/s) 128.00 MB in 0.15s
thread 0 random write totals (226.48 MB/s) 128.00 MB in 0.57s
random read on /perf1/t1 (1214.31 MB/s) 128.00 MB in 0.11s
thread 0 random read totals (1205.89 MB/s) 128.00 MB in 0.11s
[root@perf5 aio]# ./aio-stress -s 128m -r 64k -d 8 -t 1 /perf1/t1
file size 128MB, record size 64KB, depth 8, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /perf1/t1 thread 0
write on /perf1/t1 (888.01 MB/s) 128.00 MB in 0.14s
thread 0 write totals (158.88 MB/s) 128.00 MB in 0.81s
read on /perf1/t1 (1122.06 MB/s) 128.00 MB in 0.11s
thread 0 read totals (1115.04 MB/s) 128.00 MB in 0.11s
random write on /perf1/t1 (952.16 MB/s) 128.00 MB in 0.13s
thread 0 random write totals (231.62 MB/s) 128.00 MB in 0.55s
random read on /perf1/t1 (1117.60 MB/s) 128.00 MB in 0.11s
thread 0 random read totals (1117.16 MB/s) 128.00 MB in 0.11s
[root@perf5 aio]# ./aio-stress -s 128m -r 64k -d 8 -O  -t 1 /perf1/t1
file size 128MB, record size 64KB, depth 8, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /perf1/t1 thread 0
write on /perf1/t1 (174.99 MB/s) 128.00 MB in 0.73s
thread 0 write totals (172.77 MB/s) 128.00 MB in 0.74s
read on /perf1/t1 (158.71 MB/s) 128.00 MB in 0.81s
thread 0 read totals (157.04 MB/s) 128.00 MB in 0.82s
random write on /perf1/t1 (168.78 MB/s) 128.00 MB in 0.76s
thread 0 random write totals (166.89 MB/s) 128.00 MB in 0.77s
random read on /perf1/t1 (150.40 MB/s) 128.00 MB in 0.85s
thread 0 random read totals (149.02 MB/s) 128.00 MB in 0.86s
[root@perf5 aio]# uname -a
Linux perf5.lab.boston.redhat.com 2.6.21-38_fix_.fc7 #1 SMP PREEMPT RT Mon Sep
24 13:41:17 BRT 2007 x86_64 x86_64 x86_64 GNU/Linux

Differences 

             perf5                          et-virt03/4
       AMD64 8-cpu 2.8Ghz                Intel Xeon 3.4 Ghz
          4 GB memory                        4 GB memory
          IOmmu                              no-IOmmu
       lpfc (emulex FC card)           qla2462 (qlogic FC card)
       msa1000 8disk R0 - 15k rpm           (same)
         clustered GFS 5.1         Standalone 5.1 upgraded many times from
FC-to-5.0-5.1-RT


Comment 2 Luis Claudio R. Goncalves 2007-10-04 21:57:16 UTC
The fixed kernel Shak talks about is 2.6.21-38.el5rt + aio fix. No changes
regarding Fiber Channel.

Comment 3 Luis Claudio R. Goncalves 2007-10-17 20:51:12 UTC
I have run the infamous dd-of-death in three different systems and only one
misbehaved... One of them had the same qla2462 FC car but didn't present any
problem.

rt-amd-01 - OK
dell-pe1950-05 - OK
et-virt04 - died horribly when I tried both dd-o'-death versions simultaneously.

dd-of-death:
1:  dd if=/dev/sdN /dev/null   -> sdN points to the fc lun
2:  dd if=/dev/sda of=/mnt/fc_lun_mountpoint/BIG_testfile.dat  -> sda is !FC
lun. In other words, copy something big.



Comment 4 Luis Claudio R. Goncalves 2007-10-17 21:10:59 UTC
Created attachment 230391 [details]
Dmesg log for et-virt04 hanging on dd-of-death

Comment 5 Luis Claudio R. Goncalves 2007-10-25 11:34:52 UTC
A compilation of valuable data Jeff Burke observed in his tests:

The highlights are:
* The issue seems to be qla2xxx specific. Other controllers showed no problems
in this test.
* Jeff got some cool stack backtraces.

<jburke> Now running and I got this message Oct 24 16:40:18 et-virt04 kernel:
qla2xxx_eh_abort(1): aborting sp ffff8100667ba500 from RISC. pid=49030.
<jburke> here is the top output
<jburke> 4012 root      15   0 12716 1136  796 R    0  0.1   0:04.71 top       
                                                                               
                                             
<jburke>   509 root      15  -5     0    0    0 D    0  0.0   0:00.00 scsi_eh_1
                                                                               
                                              
<jburke>  4108 root      18   0  1780  444  276 D    0  0.0   0:00.39 dt       
                                                                               
                                              
<jburke>  4109 root      18   0  1780  444  276 D    0  0.0   0:00.35 dt       
                                                                               
                                              
<jburke>  4110 root      18   0  1780  444  276 D    0  0.0   0:00.34 dt       
                                                                               
                                              
<jburke>  4111 root      18   0  1780  444  276 D    0  0.0   0:00.33 dt       
                                                                               
                                              
<jburke>  4112 root      18   0  1780  444  276 D    0  0.0   0:00.36 dt       
                                                                               
                                              
<jburke>  4113 root      18   0  1780  444  276 D    0  0.0   0:00.33 dt       
                                                                               
                                              
<jburke>  4114 root      18   0  1780  444  276 D    0  0.0   0:00.33 dt       
                                                                               
                                              
<jburke>  4115 root      18   0  1780  444  276 D    0  0.0   0:00.36 dt       
                                                                               
                                              
<jburke>  4116 root      18   0  1780  444  276 D    0  0.0   0:00.34 dt     
<jburke> As soon as that message came in "qla2xxx_eh_abort(1): aborting sp
ffff8100667ba500 from RISC. pid=49030."
<lclaudio> jburke, there should be another message right after this one with the
scsi command aborted...
<jburke> Nope not coming out
<jburke> As soon as that message came in "qla2xxx_eh_abort(1): aborting sp
ffff8100667ba500 from RISC. pid=49030."
<jburke> scsi_eh_1 process jumped into top

<lclaudio> jburke, drivers/scsi/qla2xxx/qla_dbg.h  -> debuglevels are defined at
compilation time.
<lclaudio> oops
<lclaudio> that's the code:
<lclaudio>                 DEBUG2(printk("%s(%ld): aborting sp %p from RISC.
pid=%ld.\n",
<lclaudio>                     __func__, ha->host_no, sp, serial));
<lclaudio>                 DEBUG3(qla2x00_print_scsi_cmd(cmd));
<lclaudio> qla_os.c : 674
<lclaudio> anyway, DEBUG3 is void as of now.
<lclaudio> jburke, you can uncomment the defines in qla_dbg.h - recompile the
module - and get more info.
<lclaudio> lots more.

<jburke> __lc, It looks like it is a scsi error path issue
<jburke> __lc, All of the dt tasks are blocked behind the scsi_eh_1
<jburke> SysRq : Show Blocked State
<jburke>                                  free                        sibling
<jburke>   task                 PC        stack   pid father child younger older
<jburke> scsi_eh_1     D [ffff8100041ff040] ffff81007e4eeb70     0   509     59
(L-TLB)
<jburke>  ffff81007e4f1b70 0000000000000046 0000000000000003 ffff81007f57b810
<jburke>  ffff810003d95820 ffff8100041ff290 ffff8100041ff040 ffff81007f57b810
<jburke>  000002315e371e70 00000000000074d9 0000000100000003 0000000300000003
<jburke> Call Trace:
<jburke>  [<ffffffff80264714>] schedule+0xe2/0x102
<jburke>  [<ffffffff80265d84>] __compat_down+0xda/0xef
<jburke>  [<ffffffff802659da>] __compat_down_failed+0x35/0x3a
<jburke>  [<ffffffff8812188c>] :qla2xxx:qla2x00_mailbox_command+0x1e8/0x51d
<jburke>  [<ffffffff88122b8b>] :qla2xxx:qla2x00_issue_iocb+0x5f/0xc9
<jburke>  [<ffffffff88123426>] :qla2xxx:qla24xx_abort_command+0xfd/0x1b7
<jburke>  [<ffffffff8811ab37>] :qla2xxx:qla2xxx_eh_abort+0xba/0x1fa
<jburke>  [<ffffffff8807db89>] :scsi_mod:__scsi_try_to_abort_cmd+0x21/0x23
<jburke>  [<ffffffff8807f2e2>] :scsi_mod:scsi_error_handler+0x2a1/0x4cf
<jburke>  [<ffffffff80233db6>] kthread+0xf5/0x128
<jburke>  [<ffffffff8025ff68>] child_rip+0xa/0x12
<jburke> __lc, This is a driver issue with this card -> Fibre Channel: QLogic
Corp. ISP2422-based 4Gb Fibre Channel to PCI-X HBA (rev 02)



Comment 6 Tim Burke 2007-11-08 20:21:28 UTC
Markus & friends back at QLogic are pursuing this issue currently.


Comment 7 Marcus Barrow 2007-11-08 20:54:17 UTC
Attempts to reproduce at Qlogic have failed, so far. Could people provide information on the
hardware systems that are experiencing failure?

et-virt04              PCI-X 133 MHZ     fails
dell-pe1950-05   PCI-e                    good

thanks, being able to reproduce at the factory would be a big help...

Marcus


Comment 9 Clark Williams 2008-04-23 20:17:50 UTC
Since we haven't seen any further reports of FC failures in the 2.6.24.x series
of kernels, we'll close this for now.

Clark