Bug 644863 - [NetApp 5.6 bug] qla2xxx: Kernel panic on qla24xx_queuecommand
Summary: [NetApp 5.6 bug] qla2xxx: Kernel panic on qla24xx_queuecommand
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5.z
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 5.6
Assignee: Chad Dupuis (Cavium)
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 657029
TreeView+ depends on / blocked
 
Reported: 2010-10-20 13:34 UTC by Martin George
Modified: 2013-01-11 05:24 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 21:57:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
0001-qla2xxx-Clear-local-references-of-rport-on-device-lo.patch (1.79 KB, patch)
2010-10-21 18:47 UTC, Chad Dupuis (Cavium)
no flags Details | Diff
0001-qla2xxx-Clear-local-references-of-rport-on-device-l.patch version 2 (1.64 KB, patch)
2010-10-22 17:44 UTC, Chad Dupuis (Cavium)
no flags Details | Diff
/var/log/messages with QLogic verbose logging (312.83 KB, application/octet-stream)
2010-10-26 14:44 UTC, Martin George
no flags Details
qla2xxx-Add-check-for-null-fcport-in-qla24xx_queuec.patch (965 bytes, application/octet-stream)
2010-10-27 18:01 UTC, Chad Dupuis (Cavium)
no flags Details
0001-qla2xxx-Add-check-for-null-fcport-in-queuecommand-f.patch (1.18 KB, application/octet-stream)
2010-11-17 21:35 UTC, Chad Dupuis (Cavium)
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Martin George 2010-10-20 13:34:02 UTC
Description of problem:
Hit a kernel panic on a RHEL 5.5.z QLogic FC host during IO with controller faults, due to a NULL pointer dereference at qla24xx_queuecommand:

Unable to handle kernel NULL pointer dereference at 0000000000000060 RIP: 
 [<ffffffff880ce477>] :qla2xxx:qla24xx_queuecommand+0x1be/0x1dd
PGD 0 
Oops: 0000 [1] SMP 
last sysfs file: /class/fc_remote_ports/rport-1:0-1/scsi_target_id
CPU 2 
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth
lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr
iscsi_tcp bnx2i cnic ipv6 xfrm_
Pid: 433, comm: scsi_wq_0 Not tainted 2.6.18-194.11.1.el5.oct14.unblock.ver3 #1
RIP: 0010:[<ffffffff880ce477>]  [<ffffffff880ce477>]
:qla2xxx:qla24xx_queuecommand+0x1be/0x1dd
RSP: 0000:ffff81007e0eda50  EFLAGS: 00010002
RAX: 0000000000000002 RBX: ffff8100056ee080 RCX: 0000000000000190
RDX: ffff81007e0d8000 RSI: ffffffff880755a6 RDI: ffff81007e0d8060
RBP: ffff81007e5984f8 R08: 0000000000000286 R09: 0000000000000000
R10: ffff8100056ee140 R11: 0000000000000060 R12: ffff8100056ee080
R13: ffff81007e5984f8 R14: 0000000000000000 R15: ffffffff880755a6
FS:  0000000000000000(0000) GS:ffff81007ff1dec0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000060 CR3: 0000000030267000 CR4: 00000000000006e0
Process scsi_wq_0 (pid: 433, threadinfo ffff81007e0ec000, task
ffff810037c1a100)
Stack:  ffff8100763f6048 ffff8100056ee080 ffff81007e598000 0000000000000287
 ffff8100763f6048 ffff810074b94178 ffff8100763f6048 ffffffff88075c61
 ffff810027f8e1d8 ffff8100056ee080 ffff810027f8e000 ffff81007e598000
Call Trace:
 [<ffffffff88075c61>] :scsi_mod:scsi_dispatch_cmd+0x26e/0x2ff
 [<ffffffff8807b260>] :scsi_mod:scsi_request_fn+0x2c1/0x390
 [<ffffffff80144fb3>] blk_execute_rq_nowait+0x86/0x9a
 [<ffffffff80145057>] blk_execute_rq+0x90/0xc0
 [<ffffffff8807aca5>] :scsi_mod:scsi_execute+0xd1/0xea
 [<ffffffff8807ad64>] :scsi_mod:scsi_execute_req+0xa6/0xcf
 [<ffffffff8807c05a>] :scsi_mod:scsi_probe_and_add_lun+0x207/0x9c9
 [<ffffffff8807ad37>] :scsi_mod:scsi_execute_req+0x79/0xcf
 [<ffffffff8807d275>] :scsi_mod:__scsi_scan_target+0x58a/0x5c7
 [<ffffffff8008c78b>] dequeue_task+0x18/0x37
 [<ffffffff8807d55b>] :

Version-Release number of selected component (if applicable):
RHEL 5.5.z (kernel-2.6.18-194.11.1.el5)
QLogic driver v8.03.01.04.05.05-k

How reproducible:
Hit it once so far

Additional info:
The 5.5.z kernel was patched with the following 2 fixes:
1) Bug 643135 - qla2xxx-Correct-use-after-free-issue-in-terminate-rport-io.patch
2) Bug 632195 - Mike Christie's reverted block state patch

Comment 1 Chad Dupuis (Cavium) 2010-10-20 14:28:21 UTC
A couple of additional questions:

1. Were there any perturbations happening at the time of the crash?

2. Did this crash happen during initial discovery or was this hit during some sort of reconfiguration?

Comment 2 Martin George 2010-10-21 08:57:00 UTC
(In reply to comment #1)
> A couple of additional questions:
> 
> 1. Were there any perturbations happening at the time of the crash?

No, nothing in particular.

> 
> 2. Did this crash happen during initial discovery or was this hit during some
> sort of reconfiguration?

No, not during initial discovery. This crash was hit during IO on dm-multipath devices configured on NetApp LUNs with target controller faults (where one target controller head takes over the partner head and then later relinquishes control back). This translates to target ports logging out/in to the fabric corresponding to paths getting offlined/onlined on the host.

Comment 3 Martin George 2010-10-21 09:17:31 UTC
And I've hit this panic again on the QLogic FC host. So that's the 2nd time I'm hitting it.

Comment 4 Chad Dupuis (Cavium) 2010-10-21 18:46:43 UTC
The offending code is here:

(gdb) l *qla24xx_queuecommand+0x1be
0x2477 is in qla24xx_queuecommand (/root/rhel5.5.z.bz644863/kernel/drivers/scsi/qla2xxx/qla_os.c:498).
493			cmd->result = rval;
494			goto qc24_fail_command;
495		}
496	
497		/* Close window on fcport/rport state-transitioning. */
498		if (fcport->drport) {
499			cmd->result = DID_IMM_RETRY << 16;
500			goto qc24_fail_command;
501		}
502	

So it possible you may be running into this issue:
qla2xxx: Clear local references of rport on device loss timeout notification from FC transport.

I've attached the patch, could you rerun the test with this patch?

Also, if you do hit this again could you enable extended error logging (ql2xextended_error_logging=1) and attach the messages file?

Comment 5 Chad Dupuis (Cavium) 2010-10-21 18:47:31 UTC
Created attachment 454909 [details]
0001-qla2xxx-Clear-local-references-of-rport-on-device-lo.patch

Comment 6 Martin George 2010-10-22 09:24:00 UTC
(In reply to comment #5)
> Created attachment 454909 [details]
> 0001-qla2xxx-Clear-local-references-of-rport-on-device-lo.patch

This patch does not apply cleanly to the RHEL 5.5.z (2.6.18-194.11.1.el5) kernel. 1 out of the 2 hunks failed with the qla_attr.c.rej showing the following:

***************
*** 2233,2241 ****
         * all local references.
         */
        spin_lock_irq(host->host_lock);
-       fcport->rport = NULL;
        *((fc_port_t **)rport->dd_data) = NULL;
        spin_unlock_irq(host->host_lock);
  }

  static void
--- 2225,2242 ----
         * all local references.
         */
        spin_lock_irq(host->host_lock);
+       fcport->rport = fcport->drport = NULL;
        *((fc_port_t **)rport->dd_data) = NULL;
        spin_unlock_irq(host->host_lock);
+
+       if (test_bit(ABORT_ISP_ACTIVE, &ha->dpc_flags)) {
+               return ;
+       }
+
+       if (unlikely(pci_channel_offline(fcport->ha->pdev))) {
+               qla2x00_abort_all_cmds(fcport->ha, DID_NO_CONNECT << 16);
+               return;
+       }
  }

  static void

Comment 7 Chad Dupuis (Cavium) 2010-10-22 17:44:02 UTC
Created attachment 455150 [details]
0001-qla2xxx-Clear-local-references-of-rport-on-device-l.patch version 2

Patch that applies using git am against 2.6.18-194.11.1.el5.

Comment 8 Martin George 2010-10-26 13:58:31 UTC
(In reply to comment #7)
> Created attachment 455150 [details]
> 0001-qla2xxx-Clear-local-references-of-rport-on-device-l.patch version 2
> 
> Patch that applies using git am against 2.6.18-194.11.1.el5.

Patch did not help. Again hit the panic during FC switch port block/unblock test when running IO on the QLogic FC host:

Code: 49 83 7e 60 00 0f 85 10 ff ff ff e9 1c ff ff ff 41 5c 5b 5d
RIP  [<ffffffff880ce477>] :qla2xxx:qla24xx_queuecommand+0x1be/0x1dd
 RSP <ffff81007e109a50>
CR2: 0000000000000060
 <0>Kernel panic - not syncing: Fatal exception
 <1>Unable to handle kernel NULL pointer dereference at 0000000000000005 RIP:
 [<0000000000000005>]
PGD 0
Oops: 0000 [2] SMP
last sysfs file: /block/dm-7/dev
CPU 3
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa i
b_mad ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_tr
ansport_iscsi2 scsi_transport_iscsi video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotpl
ug ac parport_pc lp parport joydev sg i2c_i801 e752x_edac i2c_core edac_mc pcspkr tg3 serio_raw ide_cd cdrom dm_raid45 dm_message dm
_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm
_log dm_mod ata_piix libata shpchp qla2xxx scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 433, comm: scsi_wq_0 Not tainted 2.6.18-194.11.1.el5.oct22.unblock.ver3 #1
RIP: 0010:[<0000000000000005>]  [<0000000000000005>]
RSP: 0000:ffff81003783ff90  EFLAGS: 00010006
RAX: ffff81007e109fd8 RBX: 00000000000000ff RCX: 0000000000000000
RDX: 00000000000001b1 RSI: 00000000000000ff RDI: 00000000000000ff
RBP: ffff81007e109700 R08: 0000000000000003 R09: 000000000000003d
R10: ffff81007e1096d8 R11: 0000000000000000 R12: 0000000000000005
R13: 00000000ffffff03 R14: ffff81007e1099a8 R15: ffff810037e06080
FS:  0000000000000000(0000) GS:ffff81007ff1d6c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000005 CR3: 0000000000201000 CR4: 00000000000006e0
Process scsi_wq_0 (pid: 433, threadinfo ffff81007e108000, task ffff810037e06080)
Stack:  ffffffff80022fec ffffffff802a643f 0000000000000060 0000000000000000
 ffffffff8005dc22 ffff81007e109700 <EOI>  0000000000000000 0000000000000000
 ffff81007e1096d8 000000000000003d 0000000000000003 00000000000000ff
Call Trace:
 <IRQ>  [<ffffffff80022fec>] smp_call_function_interrupt+0x57/0x75
 [<ffffffff8005dc22>] call_function_interrupt+0x66/0x6c
 <EOI>  [<ffffffff80076712>] smp_send_stop+0x9e/0xa4
 [<ffffffff800766e0>] smp_send_stop+0x6c/0xa4
 [<ffffffff80091a61>] panic+0x94/0x1eb
 [<ffffffff80065157>] __die+0xf6/0xff
 [<ffffffff80064ffa>] oops_end+0x51/0x53
 [<ffffffff80066df0>] do_page_fault+0x766/0x874
 [<ffffffff8001a5e5>] vsnprintf+0x400/0x62f
 [<ffffffff8001724b>] release_console_sem+0x1ba/0x20e
 [<ffffffff880755a6>] :scsi_mod:scsi_done+0x0/0x18
 [<ffffffff8005dde9>] error_exit+0x0/0x84
 [<ffffffff880755a6>] :scsi_mod:scsi_done+0x0/0x18
 [<ffffffff880755a6>] :scsi_mod:scsi_done+0x0/0x18
 [<ffffffff880ce477>] :qla2xxx:qla24xx_queuecommand+0x1be/0x1dd
 [<ffffffff88075c61>] :scsi_mod:scsi_dispatch_cmd+0x26e/0x2ff
 [<ffffffff8807b260>] :scsi_mod:scsi_request_fn+0x2c1/0x390
 [<ffffffff80144fb3>] blk_execute_rq_nowait+0x86/0x9a
 [<ffffffff80145057>] blk_execute_rq+0x90/0xc0
 [<ffffffff8807aca5>] :scsi_mod:scsi_execute+0xd1/0xea
 [<ffffffff8807ad64>] :scsi_mod:scsi_execute_req+0xa6/0xcf
 [<ffffffff8807c05a>] :scsi_mod:scsi_probe_and_add_lun+0x207/0x9c9
 [<ffffffff8807ad37>] :scsi_mod:scsi_execute_req+0x79/0xcf
 [<ffffffff8807d275>] :scsi_mod:__scsi_scan_target+0x58a/0x5c7
 [<ffffffff8807d55b>] :scsi_mod:scsi_scan_target+0x6c/0x83
 [<ffffffff880b7267>] :scsi_transport_fc:fc_scsi_scan_rport+0x0/0x85
 [<ffffffff880b72cc>] :scsi_transport_fc:fc_scsi_scan_rport+0x65/0x85
 [<ffffffff8004d624>] run_workqueue+0x94/0xe4
 [<ffffffff80049e5f>] worker_thread+0x0/0x122
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80049f4f>] worker_thread+0xf0/0x122
 [<ffffffff8008cfa1>] default_wake_function+0x0/0xe
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003287b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003277d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code:  Bad RIP value.
RIP  [<0000000000000005>]
 RSP <ffff81003783ff90>
CR2: 0000000000000005
 <0>Kernel panic - not syncing: Fatal exception

Comment 9 Martin George 2010-10-26 14:44:23 UTC
Created attachment 455781 [details]
/var/log/messages with QLogic verbose logging

Comment 10 Chad Dupuis (Cavium) 2010-10-27 18:01:26 UTC
Created attachment 456041 [details]
qla2xxx-Add-check-for-null-fcport-in-qla24xx_queuec.patch

Looking at other bz's, this issue matches this issue almost exactly:
https://bugzilla.redhat.com/show_bug.cgi?id=604134.

Please try the attached patch that checks for a NULL fcport before actually queuing the command to the firmware.

Comment 11 Martin George 2010-10-27 18:13:01 UTC
Should I discard the 1st patch? Or use both together?

Comment 12 Chad Dupuis (Cavium) 2010-10-27 18:44:04 UTC
(In reply to comment #11)
> Should I discard the 1st patch? Or use both together?

Please discard the first one.

Comment 13 Martin George 2010-11-04 10:51:41 UTC
Patch looks good. Not hit the panic so far. 

Hope this is being queued for inclusion in RHEL 5.5.z & 5.6.

Comment 15 RHEL Program Management 2010-11-04 14:19:33 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 16 Martin George 2010-11-09 04:54:12 UTC
(In reply to comment #13)
> Patch looks good. Not hit the panic so far. 
> 
> Hope this is being queued for inclusion in RHEL 5.5.z & 5.6.

Chad,

Do you have any updates on this?

Comment 17 Chad Dupuis (Cavium) 2010-11-09 15:37:36 UTC
(In reply to comment #16)
> (In reply to comment #13)
> > Patch looks good. Not hit the panic so far. 
> > 
> > Hope this is being queued for inclusion in RHEL 5.5.z & 5.6.
> 
> Chad,
> 
> Do you have any updates on this?

Yes, our recommendation would be to apply this RHEL 5.6.  I'm going to post this patch internally for Red Hat's review.

Comment 18 Andrius Benokraitis 2010-11-15 14:23:35 UTC
(In reply to comment #17)
> (In reply to comment #16)
> > (In reply to comment #13)
> > > Patch looks good. Not hit the panic so far. 
> > > 
> > > Hope this is being queued for inclusion in RHEL 5.5.z & 5.6.
> > 
> > Chad,
> > 
> > Do you have any updates on this?
> 
> Yes, our recommendation would be to apply this RHEL 5.6.  I'm going to post
> this patch internally for Red Hat's review.

Chad, this will go in this bugzilla, correct?

Comment 19 Chad Dupuis (Cavium) 2010-11-15 15:37:04 UTC
> Chad, this will go in this bugzilla, correct?

Yes.

Comment 20 Chad Dupuis (Cavium) 2010-11-17 21:35:49 UTC
Created attachment 461166 [details]
0001-qla2xxx-Add-check-for-null-fcport-in-queuecommand-f.patch

Add the check for null fcport to qla2x00_queuecommand() in addition to qla24xx_queuecommand().

Comment 21 Martin George 2010-11-22 12:18:37 UTC
(In reply to comment #20)
> Created attachment 461166 [details]
> 0001-qla2xxx-Add-check-for-null-fcport-in-queuecommand-f.patch
> 
> Add the check for null fcport to qla2x00_queuecommand() in addition to
> qla24xx_queuecommand().

With this updated patch, the host has survived 12 hour sequential FC switch initiator port block/unblock tests with IO running so far.

Comment 22 Andrius Benokraitis 2010-11-22 14:41:19 UTC
(In reply to comment #20)
> Created attachment 461166 [details]
> 0001-qla2xxx-Add-check-for-null-fcport-in-queuecommand-f.patch
> 
> Add the check for null fcport to qla2x00_queuecommand() in addition to
> qla24xx_queuecommand().

Chad, was this the patch that was POSTed?

Comment 23 Chad Dupuis (Cavium) 2010-11-22 15:34:52 UTC
> 
> Chad, was this the patch that was POSTed?

Yes, http://post-office.corp.redhat.com/archives/rhkernel-list/2010-November/msg00970.html.

Comment 25 Chris Ward 2010-11-23 09:53:33 UTC
Thanks NetApp for the test feedback! In the future, it would help us out if when informing us of successful test verification, you'd also add 'NetApp' to the Verified field above. Thanks! Very much appreciated.

Comment 26 Jarod Wilson 2010-11-23 17:05:36 UTC
in kernel-2.6.18-233.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 30 errata-xmlrpc 2011-01-13 21:57:57 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.