Bug 604134

Summary: [NetApp 5.5 bug] Kernel panic hit on RHEL 5.5 FC host with QLogic external driver
Product: Red Hat Enterprise Linux 5 Reporter: Martin George <marting>
Component: kernelAssignee: Chad Dupuis (Cavium) <cdupuis>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.5.zCC: andrew.vasquez, andriusb, coughlan, lalit.chandivade, xdl-redhat-bugzilla
Target Milestone: rcKeywords: OtherQA
Target Release: 5.6   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-15 18:31:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 557597    

Description Martin George 2010-06-15 13:37:43 UTC
Description of problem:
RHEL 5.5 host with QLogic FC adapter & external driver (due to bug 598946) panics during FC switch port enable/disable as shown below:

sd 0:0:1:49: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdgt, sector 6310576
sd 0:0:1:49: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdgt, sector 504256
sd 0:0:1:49: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdgt, sector 505088
Unable to handle kernel NULL pointer dereference at 0000000000000060 RIP:
 [<ffffffff880ce45d>] :qla2xxx:qla24xx_queuecommand+0x1be/0x1dc
PGD 5f386067 PUD 5ee62067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /block/dm-21/dev
CPU 0
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd
 sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2
i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libisc
si_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi video backlight sbs
power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug a
c parport_pc lp parport floppy sg ide_cd e752x_edac edac_mc cdrom pcspkr i2c_i80
1 i2c_core tg3 serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_rou
nd_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot
dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp qla2xxx(U) scsi_transport
_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 433, comm: scsi_wq_0 Tainted: G      2.6.18-194.3.1.el5 #1
RIP: 0010:[<ffffffff880ce45d>]  [<ffffffff880ce45d>] :qla2xxx:qla24xx_queuecomma
nd+0x1be/0x1dc
RSP: 0018:ffff81007e10ba50  EFLAGS: 00010002
RAX: 0000000000000002 RBX: ffff81001f3f2680 RCX: 0000000000000190
RDX: ffff81007e38f000 RSI: ffffffff880755a6 RDI: ffff81007e38f060
RBP: ffff81007ff504f8 R08: 0000000000000282 R09: 0000000000000000
R10: ffff81001f3f2740 R11: 0000000000000060 R12: ffff81001f3f2680
R13: ffff81007ff504f8 R14: 0000000000000000 R15: ffffffff880755a6
FS:  0000000000000000(0000) GS:ffffffff803ca000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000060 CR3: 000000005e7fe000 CR4: 00000000000006e0
Process scsi_wq_0 (pid: 433, threadinfo ffff81007e10a000, task ffff810037fe1080)

Stack:  ffff810076e8ec98 ffff81001f3f2680 ffff81007ff50000 0000000000000287
 ffff810076e8ec98 ffff810023264e38 ffff810076e8ec98 ffffffff88075c61
 ffff81007bc991d8 ffff81001f3f2680 ffff81007bc99000 ffff81007ff50000
Call Trace:
 [<ffffffff88075c61>] :scsi_mod:scsi_dispatch_cmd+0x26e/0x2ff
 [<ffffffff8807b174>] :scsi_mod:scsi_request_fn+0x2c1/0x390
 [<ffffffff80144be6>] blk_execute_rq_nowait+0x86/0x9a
 [<ffffffff80144c8a>] blk_execute_rq+0x90/0xc0
 [<ffffffff8807abbb>] :scsi_mod:scsi_execute+0xd1/0xeb
 [<ffffffff8807ac7a>] :scsi_mod:scsi_execute_req+0xa5/0xce
 [<ffffffff8807bf6e>] :scsi_mod:scsi_probe_and_add_lun+0x207/0x9c9
 [<ffffffff8807ac4d>] :scsi_mod:scsi_execute_req+0x78/0xce
 [<ffffffff8807d189>] :scsi_mod:__scsi_scan_target+0x58a/0x5c7
 [<ffffffff8008c871>] dequeue_task+0x18/0x37
 [<ffffffff8807d46f>] :scsi_mod:scsi_scan_target+0x6c/0x83
 [<ffffffff880b7267>] :scsi_transport_fc:fc_scsi_scan_rport+0x0/0x85
 [<ffffffff880b72cc>] :scsi_transport_fc:fc_scsi_scan_rport+0x65/0x85
 [<ffffffff8004d8f0>] run_workqueue+0x94/0xe4
 [<ffffffff8004a12b>] worker_thread+0x0/0x122
 [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8004a21b>] worker_thread+0xf0/0x122
 [<ffffffff8008d087>] default_wake_function+0x0/0xe
 [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032894>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032796>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 49 83 7e 60 00 0f 85 10 ff ff ff e9 1c ff ff ff 5e 5b 5d 41
RIP  [<ffffffff880ce45d>] :qla2xxx:qla24xx_queuecommand+0x1be/0x1dc
 RSP <ffff81007e10ba50>
CR2: 0000000000000060
 <0>Kernel panic - not syncing: Fatal exception


Version-Release number of selected component (if applicable):
RHEL 5.5 Errata v2.6.18-194.3.1.el5
QLE2562 FW:v5.03.02 DVR:v8.03.01.06.05.06-k

How reproducible:
Intermittent.

Comment 1 Andrius Benokraitis 2010-06-15 14:31:40 UTC
Red Hat has no means to test on external drivers. If another bugzilla already reported this with the inbox driver, this can be closed.

Comment 2 Andrius Benokraitis 2010-06-15 18:31:31 UTC
Let's have QLogic look into the inbox driver issue first.

*** This bug has been marked as a duplicate of bug 598946 ***

Comment 3 Andrew Vasquez 2010-06-30 18:16:52 UTC
Martin,

the test driver lalit sent to you has the following single
line change:

> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> index 15f1f79..08de61d 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -510,7 +510,7 @@ qla24xx_queuecommand(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *))
>  	}
>  
>  	/* close window on fcport/rport state-transitioning. */
> -	if (fcport->drport) {
> +	if (!fcport || fcport->drport) {
>  		cmd->result = did_imm_retry << 16;
>  		goto qc24_fail_command;
>  	}

but...that really just works around a larger problem, as the
fcport is derived from the scsi-device's hostdata scratchpad:

	static int
	qla2x00_queuecommand(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *))
	{
		scsi_qla_host_t *ha = to_qla_host(cmd->device->host);
		fc_port_t *fcport = (struct fc_port *) cmd->device->hostdata;
		struct fc_rport *rport = starget_to_rport(scsi_target(cmd->device));
		srb_t *sp;
		int rval;

hostdata is cleared only when slave_destroy() is called by the
midlayer:

	static void
	qla2xxx_slave_destroy(struct scsi_device *sdev)
	{
		sdev->hostdata = null;
	}

i wouldn't expect the midlayer to send down (via queuecommand())
requests for a reaped scsi-device.  we can add the workaround
code, but we'd need to understand why the midlayer is sending
these scsi-commands down in the first place.

Comment 4 Chad Dupuis (Cavium) 2010-06-30 20:08:21 UTC
This bz should probably be reopened as it's actually not a duplicate of 598946.

Comment 5 Andrius Benokraitis 2010-06-30 20:19:12 UTC
(In reply to comment #4)
> This bz should probably be reopened as it's actually not a duplicate of 598946.    

We don't usually troubleshoot out-of-box drivers, so although this is CLOSED as a dupe, it should really be CLOSED WONTFIX. The only reason this was closed as a dupe was because we were under the impression a firmware update would clear up both inbox and out-of-box drivers.

Comment 6 Chad Dupuis (Cavium) 2010-06-30 21:40:22 UTC
> We don't usually troubleshoot out-of-box drivers, so although this is CLOSED as
> a dupe, it should really be CLOSED WONTFIX. The only reason this was closed as
> a dupe was because we were under the impression a firmware update would clear
> up both inbox and out-of-box drivers.    

Our concern here is that could also affect RHEL 5.6 inbox.  Would it be more appropriate to open another bz for RHEL 5.6?

Comment 7 Andrius Benokraitis 2010-06-30 21:51:29 UTC
> Our concern here is that could also affect RHEL 5.6 inbox.  Would it be more
> appropriate to open another bz for RHEL 5.6?    

I'm still confused how an out-of-box driver would affect an inbox driver.

Comment 8 Chad Dupuis (Cavium) 2010-07-01 14:35:57 UTC
> I'm still confused how an out-of-box driver would affect an inbox driver.    

Even though the other driver is out of box, they both share the same queuecommand behavior.  The one line patch listed above would apply on the rhel 5 inbox driver:

        /* Close window on fcport/rport state-transitioning. */
        if (fcport->drport) {
                cmd->result = DID_IMM_RETRY << 16;
                goto qc_fail_command;
        }

Also, the FC transport behavior would be the same in both instances.

Comment 9 Martin George 2010-07-07 11:46:30 UTC
(In reply to comment #3)
> Martin,
> 
> the test driver lalit sent to you has the following single
> line change:
> 

Andrew,

We've not hit the kernel panic with the external test driver (DVR:v8.03.01.07.05.06-k-test FW:v5.03.02) so far.

Comment 10 Lalit Chandivade 2010-07-07 12:05:47 UTC
(In reply to comment #9)
> (In reply to comment #3)
> > Martin,
> > 
> > the test driver lalit sent to you has the following single
> > line change:
> > 
> Andrew,
> We've not hit the kernel panic with the external test driver
> (DVR:v8.03.01.07.05.06-k-test FW:v5.03.02) so far.    

(In reply to comment #3)
> Martin,
> the test driver lalit sent to you has the following single
> line change:
> > diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> > index 15f1f79..08de61d 100644
> > --- a/drivers/scsi/qla2xxx/qla_os.c
> > +++ b/drivers/scsi/qla2xxx/qla_os.c
> > @@ -510,7 +510,7 @@ qla24xx_queuecommand(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *))
> >  	}
> >  
> >  	/* close window on fcport/rport state-transitioning. */
> > -	if (fcport->drport) {
> > +	if (!fcport || fcport->drport) {
> >  		cmd->result = did_imm_retry << 16;
> >  		goto qc24_fail_command;
> >  	}
> but...that really just works around a larger problem, as the

Actually the fix I provided earlier could lead to system hung, as we do immediate retry if fcport is NULL.

The correct workaround would be

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 15f1f79..60f16b6 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -510,6 +510,11 @@ qla24xx_queuecommand(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *))
        }
 
        /* Close window on fcport/rport state-transitioning. */
+       if (!fcport) {
+               cmd->result = DID_NO_CONNECT << 16;
+               goto qc24_fail_command;
+       }
+
        if (fcport->drport) {
                cmd->result = DID_IMM_RETRY << 16;
                goto qc24_fail_command;