Description of problem: Hit a kernel panic on a RHEL 5.5.z QLogic FC host during IO with controller faults, due to a NULL pointer dereference at qla24xx_queuecommand: Unable to handle kernel NULL pointer dereference at 0000000000000060 RIP: [<ffffffff880ce477>] :qla2xxx:qla24xx_queuecommand+0x1be/0x1dd PGD 0 Oops: 0000 [1] SMP last sysfs file: /class/fc_remote_ports/rport-1:0-1/scsi_target_id CPU 2 Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_ Pid: 433, comm: scsi_wq_0 Not tainted 2.6.18-194.11.1.el5.oct14.unblock.ver3 #1 RIP: 0010:[<ffffffff880ce477>] [<ffffffff880ce477>] :qla2xxx:qla24xx_queuecommand+0x1be/0x1dd RSP: 0000:ffff81007e0eda50 EFLAGS: 00010002 RAX: 0000000000000002 RBX: ffff8100056ee080 RCX: 0000000000000190 RDX: ffff81007e0d8000 RSI: ffffffff880755a6 RDI: ffff81007e0d8060 RBP: ffff81007e5984f8 R08: 0000000000000286 R09: 0000000000000000 R10: ffff8100056ee140 R11: 0000000000000060 R12: ffff8100056ee080 R13: ffff81007e5984f8 R14: 0000000000000000 R15: ffffffff880755a6 FS: 0000000000000000(0000) GS:ffff81007ff1dec0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000060 CR3: 0000000030267000 CR4: 00000000000006e0 Process scsi_wq_0 (pid: 433, threadinfo ffff81007e0ec000, task ffff810037c1a100) Stack: ffff8100763f6048 ffff8100056ee080 ffff81007e598000 0000000000000287 ffff8100763f6048 ffff810074b94178 ffff8100763f6048 ffffffff88075c61 ffff810027f8e1d8 ffff8100056ee080 ffff810027f8e000 ffff81007e598000 Call Trace: [<ffffffff88075c61>] :scsi_mod:scsi_dispatch_cmd+0x26e/0x2ff [<ffffffff8807b260>] :scsi_mod:scsi_request_fn+0x2c1/0x390 [<ffffffff80144fb3>] blk_execute_rq_nowait+0x86/0x9a [<ffffffff80145057>] blk_execute_rq+0x90/0xc0 [<ffffffff8807aca5>] :scsi_mod:scsi_execute+0xd1/0xea [<ffffffff8807ad64>] :scsi_mod:scsi_execute_req+0xa6/0xcf [<ffffffff8807c05a>] :scsi_mod:scsi_probe_and_add_lun+0x207/0x9c9 [<ffffffff8807ad37>] :scsi_mod:scsi_execute_req+0x79/0xcf [<ffffffff8807d275>] :scsi_mod:__scsi_scan_target+0x58a/0x5c7 [<ffffffff8008c78b>] dequeue_task+0x18/0x37 [<ffffffff8807d55b>] : Version-Release number of selected component (if applicable): RHEL 5.5.z (kernel-2.6.18-194.11.1.el5) QLogic driver v8.03.01.04.05.05-k How reproducible: Hit it once so far Additional info: The 5.5.z kernel was patched with the following 2 fixes: 1) Bug 643135 - qla2xxx-Correct-use-after-free-issue-in-terminate-rport-io.patch 2) Bug 632195 - Mike Christie's reverted block state patch
A couple of additional questions: 1. Were there any perturbations happening at the time of the crash? 2. Did this crash happen during initial discovery or was this hit during some sort of reconfiguration?
(In reply to comment #1) > A couple of additional questions: > > 1. Were there any perturbations happening at the time of the crash? No, nothing in particular. > > 2. Did this crash happen during initial discovery or was this hit during some > sort of reconfiguration? No, not during initial discovery. This crash was hit during IO on dm-multipath devices configured on NetApp LUNs with target controller faults (where one target controller head takes over the partner head and then later relinquishes control back). This translates to target ports logging out/in to the fabric corresponding to paths getting offlined/onlined on the host.
And I've hit this panic again on the QLogic FC host. So that's the 2nd time I'm hitting it.
The offending code is here: (gdb) l *qla24xx_queuecommand+0x1be 0x2477 is in qla24xx_queuecommand (/root/rhel5.5.z.bz644863/kernel/drivers/scsi/qla2xxx/qla_os.c:498). 493 cmd->result = rval; 494 goto qc24_fail_command; 495 } 496 497 /* Close window on fcport/rport state-transitioning. */ 498 if (fcport->drport) { 499 cmd->result = DID_IMM_RETRY << 16; 500 goto qc24_fail_command; 501 } 502 So it possible you may be running into this issue: qla2xxx: Clear local references of rport on device loss timeout notification from FC transport. I've attached the patch, could you rerun the test with this patch? Also, if you do hit this again could you enable extended error logging (ql2xextended_error_logging=1) and attach the messages file?
Created attachment 454909 [details] 0001-qla2xxx-Clear-local-references-of-rport-on-device-lo.patch
(In reply to comment #5) > Created attachment 454909 [details] > 0001-qla2xxx-Clear-local-references-of-rport-on-device-lo.patch This patch does not apply cleanly to the RHEL 5.5.z (2.6.18-194.11.1.el5) kernel. 1 out of the 2 hunks failed with the qla_attr.c.rej showing the following: *************** *** 2233,2241 **** * all local references. */ spin_lock_irq(host->host_lock); - fcport->rport = NULL; *((fc_port_t **)rport->dd_data) = NULL; spin_unlock_irq(host->host_lock); } static void --- 2225,2242 ---- * all local references. */ spin_lock_irq(host->host_lock); + fcport->rport = fcport->drport = NULL; *((fc_port_t **)rport->dd_data) = NULL; spin_unlock_irq(host->host_lock); + + if (test_bit(ABORT_ISP_ACTIVE, &ha->dpc_flags)) { + return ; + } + + if (unlikely(pci_channel_offline(fcport->ha->pdev))) { + qla2x00_abort_all_cmds(fcport->ha, DID_NO_CONNECT << 16); + return; + } } static void
Created attachment 455150 [details] 0001-qla2xxx-Clear-local-references-of-rport-on-device-l.patch version 2 Patch that applies using git am against 2.6.18-194.11.1.el5.
(In reply to comment #7) > Created attachment 455150 [details] > 0001-qla2xxx-Clear-local-references-of-rport-on-device-l.patch version 2 > > Patch that applies using git am against 2.6.18-194.11.1.el5. Patch did not help. Again hit the panic during FC switch port block/unblock test when running IO on the QLogic FC host: Code: 49 83 7e 60 00 0f 85 10 ff ff ff e9 1c ff ff ff 41 5c 5b 5d RIP [<ffffffff880ce477>] :qla2xxx:qla24xx_queuecommand+0x1be/0x1dd RSP <ffff81007e109a50> CR2: 0000000000000060 <0>Kernel panic - not syncing: Fatal exception <1>Unable to handle kernel NULL pointer dereference at 0000000000000005 RIP: [<0000000000000005>] PGD 0 Oops: 0000 [2] SMP last sysfs file: /block/dm-7/dev CPU 3 Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa i b_mad ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_tr ansport_iscsi2 scsi_transport_iscsi video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotpl ug ac parport_pc lp parport joydev sg i2c_i801 e752x_edac i2c_core edac_mc pcspkr tg3 serio_raw ide_cd cdrom dm_raid45 dm_message dm _region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm _log dm_mod ata_piix libata shpchp qla2xxx scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 433, comm: scsi_wq_0 Not tainted 2.6.18-194.11.1.el5.oct22.unblock.ver3 #1 RIP: 0010:[<0000000000000005>] [<0000000000000005>] RSP: 0000:ffff81003783ff90 EFLAGS: 00010006 RAX: ffff81007e109fd8 RBX: 00000000000000ff RCX: 0000000000000000 RDX: 00000000000001b1 RSI: 00000000000000ff RDI: 00000000000000ff RBP: ffff81007e109700 R08: 0000000000000003 R09: 000000000000003d R10: ffff81007e1096d8 R11: 0000000000000000 R12: 0000000000000005 R13: 00000000ffffff03 R14: ffff81007e1099a8 R15: ffff810037e06080 FS: 0000000000000000(0000) GS:ffff81007ff1d6c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000005 CR3: 0000000000201000 CR4: 00000000000006e0 Process scsi_wq_0 (pid: 433, threadinfo ffff81007e108000, task ffff810037e06080) Stack: ffffffff80022fec ffffffff802a643f 0000000000000060 0000000000000000 ffffffff8005dc22 ffff81007e109700 <EOI> 0000000000000000 0000000000000000 ffff81007e1096d8 000000000000003d 0000000000000003 00000000000000ff Call Trace: <IRQ> [<ffffffff80022fec>] smp_call_function_interrupt+0x57/0x75 [<ffffffff8005dc22>] call_function_interrupt+0x66/0x6c <EOI> [<ffffffff80076712>] smp_send_stop+0x9e/0xa4 [<ffffffff800766e0>] smp_send_stop+0x6c/0xa4 [<ffffffff80091a61>] panic+0x94/0x1eb [<ffffffff80065157>] __die+0xf6/0xff [<ffffffff80064ffa>] oops_end+0x51/0x53 [<ffffffff80066df0>] do_page_fault+0x766/0x874 [<ffffffff8001a5e5>] vsnprintf+0x400/0x62f [<ffffffff8001724b>] release_console_sem+0x1ba/0x20e [<ffffffff880755a6>] :scsi_mod:scsi_done+0x0/0x18 [<ffffffff8005dde9>] error_exit+0x0/0x84 [<ffffffff880755a6>] :scsi_mod:scsi_done+0x0/0x18 [<ffffffff880755a6>] :scsi_mod:scsi_done+0x0/0x18 [<ffffffff880ce477>] :qla2xxx:qla24xx_queuecommand+0x1be/0x1dd [<ffffffff88075c61>] :scsi_mod:scsi_dispatch_cmd+0x26e/0x2ff [<ffffffff8807b260>] :scsi_mod:scsi_request_fn+0x2c1/0x390 [<ffffffff80144fb3>] blk_execute_rq_nowait+0x86/0x9a [<ffffffff80145057>] blk_execute_rq+0x90/0xc0 [<ffffffff8807aca5>] :scsi_mod:scsi_execute+0xd1/0xea [<ffffffff8807ad64>] :scsi_mod:scsi_execute_req+0xa6/0xcf [<ffffffff8807c05a>] :scsi_mod:scsi_probe_and_add_lun+0x207/0x9c9 [<ffffffff8807ad37>] :scsi_mod:scsi_execute_req+0x79/0xcf [<ffffffff8807d275>] :scsi_mod:__scsi_scan_target+0x58a/0x5c7 [<ffffffff8807d55b>] :scsi_mod:scsi_scan_target+0x6c/0x83 [<ffffffff880b7267>] :scsi_transport_fc:fc_scsi_scan_rport+0x0/0x85 [<ffffffff880b72cc>] :scsi_transport_fc:fc_scsi_scan_rport+0x65/0x85 [<ffffffff8004d624>] run_workqueue+0x94/0xe4 [<ffffffff80049e5f>] worker_thread+0x0/0x122 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff80049f4f>] worker_thread+0xf0/0x122 [<ffffffff8008cfa1>] default_wake_function+0x0/0xe [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003287b>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003277d>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: Bad RIP value. RIP [<0000000000000005>] RSP <ffff81003783ff90> CR2: 0000000000000005 <0>Kernel panic - not syncing: Fatal exception
Created attachment 455781 [details] /var/log/messages with QLogic verbose logging
Created attachment 456041 [details] qla2xxx-Add-check-for-null-fcport-in-qla24xx_queuec.patch Looking at other bz's, this issue matches this issue almost exactly: https://bugzilla.redhat.com/show_bug.cgi?id=604134. Please try the attached patch that checks for a NULL fcport before actually queuing the command to the firmware.
Should I discard the 1st patch? Or use both together?
(In reply to comment #11) > Should I discard the 1st patch? Or use both together? Please discard the first one.
Patch looks good. Not hit the panic so far. Hope this is being queued for inclusion in RHEL 5.5.z & 5.6.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
(In reply to comment #13) > Patch looks good. Not hit the panic so far. > > Hope this is being queued for inclusion in RHEL 5.5.z & 5.6. Chad, Do you have any updates on this?
(In reply to comment #16) > (In reply to comment #13) > > Patch looks good. Not hit the panic so far. > > > > Hope this is being queued for inclusion in RHEL 5.5.z & 5.6. > > Chad, > > Do you have any updates on this? Yes, our recommendation would be to apply this RHEL 5.6. I'm going to post this patch internally for Red Hat's review.
(In reply to comment #17) > (In reply to comment #16) > > (In reply to comment #13) > > > Patch looks good. Not hit the panic so far. > > > > > > Hope this is being queued for inclusion in RHEL 5.5.z & 5.6. > > > > Chad, > > > > Do you have any updates on this? > > Yes, our recommendation would be to apply this RHEL 5.6. I'm going to post > this patch internally for Red Hat's review. Chad, this will go in this bugzilla, correct?
> Chad, this will go in this bugzilla, correct? Yes.
Created attachment 461166 [details] 0001-qla2xxx-Add-check-for-null-fcport-in-queuecommand-f.patch Add the check for null fcport to qla2x00_queuecommand() in addition to qla24xx_queuecommand().
(In reply to comment #20) > Created attachment 461166 [details] > 0001-qla2xxx-Add-check-for-null-fcport-in-queuecommand-f.patch > > Add the check for null fcport to qla2x00_queuecommand() in addition to > qla24xx_queuecommand(). With this updated patch, the host has survived 12 hour sequential FC switch initiator port block/unblock tests with IO running so far.
(In reply to comment #20) > Created attachment 461166 [details] > 0001-qla2xxx-Add-check-for-null-fcport-in-queuecommand-f.patch > > Add the check for null fcport to qla2x00_queuecommand() in addition to > qla24xx_queuecommand(). Chad, was this the patch that was POSTed?
> > Chad, was this the patch that was POSTed? Yes, http://post-office.corp.redhat.com/archives/rhkernel-list/2010-November/msg00970.html.
Thanks NetApp for the test feedback! In the future, it would help us out if when informing us of successful test verification, you'd also add 'NetApp' to the Verified field above. Thanks! Very much appreciated.
in kernel-2.6.18-233.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html