Bug 628583 - [NetApp/QLogic 5.6 bug] Kernel panic hit on RHEL 5.5 QLogic FC host at qla2x00_abort_fcport_cmds
Summary: [NetApp/QLogic 5.6 bug] Kernel panic hit on RHEL 5.5 QLogic FC host at qla2x0...
Keywords:
Status: CLOSED DUPLICATE of bug 643135
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5.z
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 5.6
Assignee: Chad Dupuis (Cavium)
QA Contact: Storage QE
URL:
Whiteboard:
Depends On:
Blocks: 557597
TreeView+ depends on / blocked
 
Reported: 2010-08-30 12:54 UTC by Martin George
Modified: 2010-10-14 18:28 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-10-14 18:18:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
qla2xxx-Correct-use-after-free-issue-in-terminate-rport-io.patch (3.15 KB, patch)
2010-09-13 17:06 UTC, Chad Dupuis (Cavium)
no flags Details | Diff

Description Martin George 2010-08-30 12:54:48 UTC
Description of problem:
Hit a kernel panic on a RHEL 5.5 FC host with a QLogic 8G FC adapter (QLE2562) while running IO with target controller faults:

Pid: 439, comm: fc_wq_1 Not tainted 2.6.18-194.11.1.el5 #1
RIP: 0010:[<ffffffff880cc72b>]  [<ffffffff880cc72b>] :qla2xxx:qla2x00_abort_fcport_cmds+0x8c/0x124
RSP: 0018:ffff81007e379dc0  EFLAGS: 00010202
RAX: 302e302020202020 RBX: ffff810020e15a00 RCX: 0000000000000040
RDX: 0000000000000003 RSI: 0000000000000246 RDI: ffff81007f976870
RBP: ffff81007f976800 R08: ffff81007e1784f8 R09: 000000007e175300
R10: 0000000000000000 R11: 0000000000000000 R12: ffff81007e1784f8
R13: ffff81007e1784f8 R14: ffff81007e178578 R15: 00000000000001b4
FS:  0000000000000000(0000) GS:ffff81007ff0c840(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaabfd2000 CR3: 0000000070f2c000 CR4: 00000000000006e0
Process fc_wq_1 (pid: 439, threadinfo ffff81007e378000, task ffff810037c17860)
Stack:  ffff81007e54b880 ffff81007e54b880 ffff81007e178000 ffff81007dc45000
 0000000000000282 ffff81007dc45000 ffffffff880b7172 ffffffff880e87f1
 ffff81007e178000 ffff81007dc45000 ffff81007e6db2c0 ffffffff880b71c5
Call Trace:
 [<ffffffff880b7172>] :scsi_transport_fc:fc_starget_delete+0x0/0xf5
 [<ffffffff880e87f1>] :qla2xxx:qla2x00_dev_loss_tmo_callbk+0xc2/0xef
 [<ffffffff880b71c5>] :scsi_transport_fc:fc_starget_delete+0x53/0xf5
 [<ffffffff8004d624>] run_workqueue+0x94/0xe4
 [<ffffffff80049e5f>] worker_thread+0x0/0x122
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80049f4f>] worker_thread+0xf0/0x122
 [<ffffffff8008cfa1>] default_wake_function+0x0/0xe
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003287b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003277d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Code: 48 8b 58 38 e8 45 92 0f f8 48 8d 95 c8 01 00 00 48 89 c6 48 
RIP  [<ffffffff880cc72b>] :qla2xxx:qla2x00_abort_fcport_cmds+0x8c/0x124
 RSP <ffff81007e379dc0>
 <0>Kernel panic - not syncing: Fatal exception

Version-Release number of selected component (if applicable):
RHEL 5.5 Errata (2.6.18-194.11.1.el5)
QLE2562 FW:v5.03.02 DVR:v8.03.01.04.05.05-k

How reproducible:
Hit it once so far.

Comment 1 RHEL Program Management 2010-08-30 18:49:26 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Rob Evers 2010-09-01 15:29:10 UTC
From

RIP: 0010:[<ffffffff880cc72b>]  [<ffffffff880cc72b>]
:qla2xxx:qla2x00_abort_fcport_cmds+0x8c/0x124

Looks like the line of code below is where the panic is induced:

qla2x00_abort_fcport_cmds(fc_port_t *fcport)
{
        int cnt;
        unsigned long flags;
        srb_t *sp;
        scsi_qla_host_t *ha = fcport->ha;
        scsi_qla_host_t *pha = to_qla_parent(ha);

        spin_lock_irqsave(&pha->hardware_lock, flags);
        for (cnt = 1; cnt < MAX_OUTSTANDING_COMMANDS; cnt++) {
                sp = pha->outstanding_cmds[cnt];
                if (!sp)
                        continue;
                if (sp->fcport != fcport)
                        continue;

                spin_unlock_irqrestore(&pha->hardware_lock, flags);
                if (ha->isp_ops->abort_command(ha, sp)) {
                        DEBUG2(qla_printk(KERN_WARNING, ha,
                            "Abort failed --  %lx\n", sp->cmd->serial_number));

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In particular, the sp->cmd is invalid.  Can't tell for sure because the part of the stack trace that indicates what caused the problem isn't included.

Is it possible to configure kdump (and test the kdump configuration under load esp. if boot from SAN) while running the scenario that induced this problem?

A dump file would provide much information valuable to diagnosing this problem.

Comment 3 Chad Dupuis (Cavium) 2010-09-01 15:44:08 UTC
Another data point is that the function, qla2x00_abort_fcport_cmd() is not present in the RHEL 5.6 beta kernels.  Instead qla2x00_dev_loss_tmo_callbk() will simply make the driver remember any references to the rport and let the transport and midlayer deal with the outstanding commands.

static void
qla2x00_dev_loss_tmo_callbk(struct fc_rport *rport)
{
        scsi_qla_host_t *ha = NULL;
        struct Scsi_Host *host = rport_to_shost(rport);
        fc_port_t *fcport = *(fc_port_t **)rport->dd_data;

        if (!fcport)
                return;

        ha = to_qla_parent(fcport->ha);
        if (test_bit(ABORT_ISP_ACTIVE, &ha->dpc_flags)) {
                return ;
        }

        if (unlikely(pci_channel_offline(fcport->ha->pdev))) {
                qla2x00_abort_all_cmds(fcport->ha, DID_NO_CONNECT << 16);
                return;
        }

        /*
         * At this point all fcport's software-states are cleared.  Perform any
         * final cleanup of firmware resources (PCBs and XCBs).
         */
        if (fcport->loop_id != FC_NO_LOOP_ID &&
            !test_bit(UNLOADING, &fcport->ha->dpc_flags)) {
                fcport->ha->isp_ops->fabric_logout(fcport->ha,
                        fcport->loop_id, fcport->d_id.b.domain,
                        fcport->d_id.b.area, fcport->d_id.b.al_pa);
                fcport->loop_id = FC_NO_LOOP_ID;
        }

        /*
         * Transport has effectively 'deleted' the rport, clear
         * all local references.
         */
        spin_lock_irq(host->host_lock);
        fcport->rport = NULL;
        *((fc_port_t **)rport->dd_data) = NULL;
        spin_unlock_irq(host->host_lock);
}

It's possible that no fix may need to be made to RHEL 5.6 as the semantics of the dev_loss_tmo processing may be different.

Comment 4 Martin George 2010-09-07 17:59:58 UTC
(In reply to comment #2)
> 
> Is it possible to configure kdump (and test the kdump configuration under load
> esp. if boot from SAN) while running the scenario that induced this problem?
> 
> A dump file would provide much information valuable to diagnosing this problem.

This is extremely difficult to reproduce. As mentioned in comment #0, I've hit this only once in my tests so far (and that too on a root on dm-multipath SANBoot setup). So I doubt if I'll be able to provide you with the requisite dumps as requested. Was hoping this could be root caused by analyzing the above stack trace snippet itself.

Comment 5 Chad Dupuis (Cavium) 2010-09-08 15:36:58 UTC
After talking internally about it, it's possible that we're running into an issue that is fixed here:

[SCSI] qla2xxx: Correct use-after-free oops seen during EH-abort.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=083a469db4ecf3b286a96b5b722c37fc1affe0be

Essentially, we implemented referencing counting for sp's to make sure that some sp's weren't invalidated while the hardware_lock was dropped.

Comment 6 Martin George 2010-09-10 19:53:06 UTC
(In reply to comment #5)
> After talking internally about it, it's possible that we're running into an
> issue that is fixed here:
> 
> [SCSI] qla2xxx: Correct use-after-free oops seen during EH-abort.
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=083a469db4ecf3b286a96b5b722c37fc1affe0be
> 

This patch does not apply cleanly to the above RHEL 5.5 Errata (2.6.18-194.11.1.el5). So could you send me an updated patch for the same?

Comment 7 Martin George 2010-09-13 08:15:31 UTC
I hit the same kernel panic again. And this was with a reverted block state patch given by Mike Christie for fixing the regression described in 5.5 bug 632195 .

I've attached the complete stack trace below:

qla2x00_mailbox_command(0): cmd access timeout. Exiting.
qla2x00_issue_iocb(0): failed rval 0x100
qla2x00_issue_iocb(0): failed rval 0x100
qla24xx_abort_command(0): failed to issue IOCB (100).
Unable to handle kernel paging request at 00000000bfebfc37 RIP:
 [<ffffffff880cc72b>] :qla2xxx:qla2x00_abort_fcport_cmds+0x8c/0x124
PGD 7b0ca067 PUD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:02.0/0000:02:00.0/host0/rport-0:0-0/target0:0:0/0:0:0:4/timeout
CPU 0
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa dPid: 434, comm: fc_wq_0 Not tainted 2.6.18-194.11.1.el5.unblock #1
RIP: 0010:[<ffffffff880cc72b>]  [<ffffffff880cc72b>] :qla2xxx:qla2x00_abort_fcport_cmds+0x8c/0x124
RSP: 0018:ffff81007e0f3dc0  EFLAGS: 00010202
RAX: 00000000bfebfbff RBX: ffff81001f929c00 RCX: 0000000000000040
RDX: 0000000000000002 RSI: 0000000000000246 RDI: ffff81007f97b070
RBP: ffff81007f97b000 R08: ffff81007e5ac4f8 R09: 000000007e6da200
R10: 0000000000000000 R11: 0000000000000000 R12: ffff81007e5ac4f8
R13: ffff81007e5ac4f8 R14: ffff81007e5ac578 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffffffff803ca000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000bfebfc37 CR3: 00000000700da000 CR4: 00000000000006e0
Process fc_wq_0 (pid: 434, threadinfo ffff81007e0f2000, task ffff810037fe17e0)
Stack:  ffff810037a0b1c0 ffff810037a0b1c0 ffff81007e5ac000 ffff810037b6c800
 0000000000000282 ffff810037b6c800 ffffffff880b7172 ffffffff880e87f1
 ffff81007e5ac000 ffff810037b6c800 ffff810037a09bc0 ffffffff880b71c5
Call Trace:
 [<ffffffff880b7172>] :scsi_transport_fc:fc_starget_delete+0x0/0xf5
 [<ffffffff880e87f1>] :qla2xxx:qla2x00_dev_loss_tmo_callbk+0xc2/0xef
 [<ffffffff880b71c5>] :scsi_transport_fc:fc_starget_delete+0x53/0xf5
 [<ffffffff8004d624>] run_workqueue+0x94/0xe4
 [<ffffffff80049e5f>] worker_thread+0x0/0x122
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80049f4f>] worker_thread+0xf0/0x122
 [<ffffffff8008cfa1>] default_wake_function+0x0/0xe
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003287b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003277d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 48 8b 58 38 e8 45 92 0f f8 48 8d 95 c8 01 00 00 48 89 c6 48
RIP  [<ffffffff880cc72b>] :qla2xxx:qla2x00_abort_fcport_cmds+0x8c/0x124
 RSP <ffff81007e0f3dc0>
CR2: 00000000bfebfc37
 <0>Kernel panic - not syncing: Fatal exception

So if you could send me the backported 5.5 patch which should address this, I could check whether this actually resolves the issue.

Comment 8 Andrius Benokraitis 2010-09-13 13:15:12 UTC
Chad - looks like we need a patch that applies cleanly to RHEL 5.5 for this ASAP.

Comment 9 Chad Dupuis (Cavium) 2010-09-13 17:06:33 UTC
Created attachment 446997 [details]
qla2xxx-Correct-use-after-free-issue-in-terminate-rport-io.patch

I was looked through the patches that have been applied to RHEL 5.6 and I believe we are hitting this condition.  Please try the attached patch to see if this fixes the race condition we are seeing.  I tested applying and compiling this patch against the RHEL 5.5. stock kernel.

Comment 10 Martin George 2010-09-14 17:17:30 UTC
(In reply to comment #9)
> Created attachment 446997 [details]
> qla2xxx-Correct-use-after-free-issue-in-terminate-rport-io.patch
> 
> I was looked through the patches that have been applied to RHEL 5.6 and I
> believe we are hitting this condition.  Please try the attached patch to see if
> this fixes the race condition we are seeing.  

Preliminary tests with this patch look good. Will run longer tests now and keep you posted on the results.

Comment 12 Chad Dupuis (Cavium) 2010-09-24 14:21:33 UTC
(In reply to comment #10)
> 
> Preliminary tests with this patch look good. Will run longer tests now and keep
> you posted on the results.

Just curious if this crash has been seen again.  If it has not, I'd recommend that this point fix be a candidate for a 5.5.z kernel.

Comment 14 Martin George 2010-10-07 19:34:10 UTC
(In reply to comment #12)
> Just curious if this crash has been seen again.  If it has not, I'd recommend
> that this point fix be a candidate for a 5.5.z kernel.

Not hit the crash so far with this patch. So please include it for the next 5.5.z release.

Comment 15 Martin George 2010-10-13 20:33:23 UTC
Has this been queued for the next 5.5.z release?

Comment 16 Chad Dupuis (Cavium) 2010-10-13 20:44:25 UTC
(In reply to comment #15)
> Has this been queued for the next 5.5.z release?

It will be queued up tomorrow.

Comment 17 Andrius Benokraitis 2010-10-14 15:32:39 UTC
(In reply to comment #16)
> (In reply to comment #15)
> > Has this been queued for the next 5.5.z release?
> 
> It will be queued up tomorrow.

You mean, "queued up for 5.6" tomorrow, not 5.5.z.

Comment 18 Chad Dupuis (Cavium) 2010-10-14 16:44:29 UTC
> 
> You mean, "queued up for 5.6" tomorrow, not 5.5.z.

The patch has actually already been accepted for RHEL 5.6.  It was part of the commit "qla2xxx: update to 8.03.01.05.05.06-k".  For the z-stream I would presume we'd only want to back port the specific point fix for this issue.

Comment 19 Andrius Benokraitis 2010-10-14 16:53:06 UTC
(In reply to comment #18)
> > 
> > You mean, "queued up for 5.6" tomorrow, not 5.5.z.
> 
> The patch has actually already been accepted for RHEL 5.6.  It was part of the
> commit "qla2xxx: update to 8.03.01.05.05.06-k".  For the z-stream I would
> presume we'd only want to back port the specific point fix for this issue.

Chad, we need to follow the correct z-stream process here. Which bugzilla had this patch?

Comment 20 Chad Dupuis (Cavium) 2010-10-14 17:10:33 UTC
(In reply to comment #19)
> (In reply to comment #18)
> > > 
> > > You mean, "queued up for 5.6" tomorrow, not 5.5.z.
> > 
> > The patch has actually already been accepted for RHEL 5.6.  It was part of the
> > commit "qla2xxx: update to 8.03.01.05.05.06-k".  For the z-stream I would
> > presume we'd only want to back port the specific point fix for this issue.
> 
> Chad, we need to follow the correct z-stream process here. Which bugzilla had
> this patch?

The fix was contained in this bz: Bug 567428  - [QLogic 5.6 FEAT] Update qla2xxx driver to version x.y.z

Comment 21 Andrius Benokraitis 2010-10-14 17:16:57 UTC
Chad, please post the discrete patch that fixes this issue in the actual z-stream bugzilla, bug 627216.

Comment 22 Andrius Benokraitis 2010-10-14 17:18:36 UTC

*** This bug has been marked as a duplicate of bug 627216 ***

Comment 23 Andrius Benokraitis 2010-10-14 17:47:28 UTC
(In reply to comment #22)
> 
> *** This bug has been marked as a duplicate of bug 627216 ***

This was an incorrect duplicate-to. A new z-stream will be created for this specific issue.

Comment 24 Jiri Pirko 2010-10-14 18:18:24 UTC

*** This bug has been marked as a duplicate of bug 643135 ***

Comment 25 Andrius Benokraitis 2010-10-14 18:26:22 UTC
Chad, can you please POST the discrete patch against this bug 628583 ASAP - we will correct the bugzilla when a new 5.5.z bugzilla is created from bug 567428. This new bugzilla will be duped from this.

Comment 26 Andrius Benokraitis 2010-10-14 18:28:11 UTC
New bugzilla created. Please post discrete patch to:

https://bugzilla.redhat.com/show_bug.cgi?id=643135


Note You need to log in before you can comment on or make changes to this bug.