Description of problem: RHEL 5.5 host with QLogic FC adapter & external driver (due to bug 598946) panics during FC switch port enable/disable as shown below: sd 0:0:1:49: SCSI error: return code = 0x00010000 end_request: I/O error, dev sdgt, sector 6310576 sd 0:0:1:49: SCSI error: return code = 0x00010000 end_request: I/O error, dev sdgt, sector 504256 sd 0:0:1:49: SCSI error: return code = 0x00010000 end_request: I/O error, dev sdgt, sector 505088 Unable to handle kernel NULL pointer dereference at 0000000000000060 RIP: [<ffffffff880ce45d>] :qla2xxx:qla24xx_queuecommand+0x1be/0x1dc PGD 5f386067 PUD 5ee62067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /block/dm-21/dev CPU 0 Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2 i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libisc si_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug a c parport_pc lp parport floppy sg ide_cd e752x_edac edac_mc cdrom pcspkr i2c_i80 1 i2c_core tg3 serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_rou nd_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp qla2xxx(U) scsi_transport _fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 433, comm: scsi_wq_0 Tainted: G 2.6.18-194.3.1.el5 #1 RIP: 0010:[<ffffffff880ce45d>] [<ffffffff880ce45d>] :qla2xxx:qla24xx_queuecomma nd+0x1be/0x1dc RSP: 0018:ffff81007e10ba50 EFLAGS: 00010002 RAX: 0000000000000002 RBX: ffff81001f3f2680 RCX: 0000000000000190 RDX: ffff81007e38f000 RSI: ffffffff880755a6 RDI: ffff81007e38f060 RBP: ffff81007ff504f8 R08: 0000000000000282 R09: 0000000000000000 R10: ffff81001f3f2740 R11: 0000000000000060 R12: ffff81001f3f2680 R13: ffff81007ff504f8 R14: 0000000000000000 R15: ffffffff880755a6 FS: 0000000000000000(0000) GS:ffffffff803ca000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000060 CR3: 000000005e7fe000 CR4: 00000000000006e0 Process scsi_wq_0 (pid: 433, threadinfo ffff81007e10a000, task ffff810037fe1080) Stack: ffff810076e8ec98 ffff81001f3f2680 ffff81007ff50000 0000000000000287 ffff810076e8ec98 ffff810023264e38 ffff810076e8ec98 ffffffff88075c61 ffff81007bc991d8 ffff81001f3f2680 ffff81007bc99000 ffff81007ff50000 Call Trace: [<ffffffff88075c61>] :scsi_mod:scsi_dispatch_cmd+0x26e/0x2ff [<ffffffff8807b174>] :scsi_mod:scsi_request_fn+0x2c1/0x390 [<ffffffff80144be6>] blk_execute_rq_nowait+0x86/0x9a [<ffffffff80144c8a>] blk_execute_rq+0x90/0xc0 [<ffffffff8807abbb>] :scsi_mod:scsi_execute+0xd1/0xeb [<ffffffff8807ac7a>] :scsi_mod:scsi_execute_req+0xa5/0xce [<ffffffff8807bf6e>] :scsi_mod:scsi_probe_and_add_lun+0x207/0x9c9 [<ffffffff8807ac4d>] :scsi_mod:scsi_execute_req+0x78/0xce [<ffffffff8807d189>] :scsi_mod:__scsi_scan_target+0x58a/0x5c7 [<ffffffff8008c871>] dequeue_task+0x18/0x37 [<ffffffff8807d46f>] :scsi_mod:scsi_scan_target+0x6c/0x83 [<ffffffff880b7267>] :scsi_transport_fc:fc_scsi_scan_rport+0x0/0x85 [<ffffffff880b72cc>] :scsi_transport_fc:fc_scsi_scan_rport+0x65/0x85 [<ffffffff8004d8f0>] run_workqueue+0x94/0xe4 [<ffffffff8004a12b>] worker_thread+0x0/0x122 [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 [<ffffffff8004a21b>] worker_thread+0xf0/0x122 [<ffffffff8008d087>] default_wake_function+0x0/0xe [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032894>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032796>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 49 83 7e 60 00 0f 85 10 ff ff ff e9 1c ff ff ff 5e 5b 5d 41 RIP [<ffffffff880ce45d>] :qla2xxx:qla24xx_queuecommand+0x1be/0x1dc RSP <ffff81007e10ba50> CR2: 0000000000000060 <0>Kernel panic - not syncing: Fatal exception Version-Release number of selected component (if applicable): RHEL 5.5 Errata v2.6.18-194.3.1.el5 QLE2562 FW:v5.03.02 DVR:v8.03.01.06.05.06-k How reproducible: Intermittent.
Red Hat has no means to test on external drivers. If another bugzilla already reported this with the inbox driver, this can be closed.
Let's have QLogic look into the inbox driver issue first. *** This bug has been marked as a duplicate of bug 598946 ***
Martin, the test driver lalit sent to you has the following single line change: > diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c > index 15f1f79..08de61d 100644 > --- a/drivers/scsi/qla2xxx/qla_os.c > +++ b/drivers/scsi/qla2xxx/qla_os.c > @@ -510,7 +510,7 @@ qla24xx_queuecommand(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *)) > } > > /* close window on fcport/rport state-transitioning. */ > - if (fcport->drport) { > + if (!fcport || fcport->drport) { > cmd->result = did_imm_retry << 16; > goto qc24_fail_command; > } but...that really just works around a larger problem, as the fcport is derived from the scsi-device's hostdata scratchpad: static int qla2x00_queuecommand(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *)) { scsi_qla_host_t *ha = to_qla_host(cmd->device->host); fc_port_t *fcport = (struct fc_port *) cmd->device->hostdata; struct fc_rport *rport = starget_to_rport(scsi_target(cmd->device)); srb_t *sp; int rval; hostdata is cleared only when slave_destroy() is called by the midlayer: static void qla2xxx_slave_destroy(struct scsi_device *sdev) { sdev->hostdata = null; } i wouldn't expect the midlayer to send down (via queuecommand()) requests for a reaped scsi-device. we can add the workaround code, but we'd need to understand why the midlayer is sending these scsi-commands down in the first place.
This bz should probably be reopened as it's actually not a duplicate of 598946.
(In reply to comment #4) > This bz should probably be reopened as it's actually not a duplicate of 598946. We don't usually troubleshoot out-of-box drivers, so although this is CLOSED as a dupe, it should really be CLOSED WONTFIX. The only reason this was closed as a dupe was because we were under the impression a firmware update would clear up both inbox and out-of-box drivers.
> We don't usually troubleshoot out-of-box drivers, so although this is CLOSED as > a dupe, it should really be CLOSED WONTFIX. The only reason this was closed as > a dupe was because we were under the impression a firmware update would clear > up both inbox and out-of-box drivers. Our concern here is that could also affect RHEL 5.6 inbox. Would it be more appropriate to open another bz for RHEL 5.6?
> Our concern here is that could also affect RHEL 5.6 inbox. Would it be more > appropriate to open another bz for RHEL 5.6? I'm still confused how an out-of-box driver would affect an inbox driver.
> I'm still confused how an out-of-box driver would affect an inbox driver. Even though the other driver is out of box, they both share the same queuecommand behavior. The one line patch listed above would apply on the rhel 5 inbox driver: /* Close window on fcport/rport state-transitioning. */ if (fcport->drport) { cmd->result = DID_IMM_RETRY << 16; goto qc_fail_command; } Also, the FC transport behavior would be the same in both instances.
(In reply to comment #3) > Martin, > > the test driver lalit sent to you has the following single > line change: > Andrew, We've not hit the kernel panic with the external test driver (DVR:v8.03.01.07.05.06-k-test FW:v5.03.02) so far.
(In reply to comment #9) > (In reply to comment #3) > > Martin, > > > > the test driver lalit sent to you has the following single > > line change: > > > Andrew, > We've not hit the kernel panic with the external test driver > (DVR:v8.03.01.07.05.06-k-test FW:v5.03.02) so far. (In reply to comment #3) > Martin, > the test driver lalit sent to you has the following single > line change: > > diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c > > index 15f1f79..08de61d 100644 > > --- a/drivers/scsi/qla2xxx/qla_os.c > > +++ b/drivers/scsi/qla2xxx/qla_os.c > > @@ -510,7 +510,7 @@ qla24xx_queuecommand(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *)) > > } > > > > /* close window on fcport/rport state-transitioning. */ > > - if (fcport->drport) { > > + if (!fcport || fcport->drport) { > > cmd->result = did_imm_retry << 16; > > goto qc24_fail_command; > > } > but...that really just works around a larger problem, as the Actually the fix I provided earlier could lead to system hung, as we do immediate retry if fcport is NULL. The correct workaround would be diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 15f1f79..60f16b6 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -510,6 +510,11 @@ qla24xx_queuecommand(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *)) } /* Close window on fcport/rport state-transitioning. */ + if (!fcport) { + cmd->result = DID_NO_CONNECT << 16; + goto qc24_fail_command; + } + if (fcport->drport) { cmd->result = DID_IMM_RETRY << 16; goto qc24_fail_command;