Bug 518522
Summary: | crash in qla2x00_abort_fcport_cmds | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Issue Tracker <tao> |
Component: | kernel | Assignee: | Chad Dupuis (Cavium) <cdupuis> |
Status: | CLOSED CANTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5.3 | CC: | bdonahue, coughlan, cww, jwest, tao |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-09-10 14:39:20 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 502912, 600363 |
Description
Issue Tracker
2009-08-20 17:39:53 UTC
Event posted on 08-19-2009 07:29pm EDT by woodard We are seeing a crash that exactly matches https://bugzilla.redhat.com/show_bug.cgi?id=465945 perfectly but we aren't yet sure what triggered it and this is on a -128.1.x kernel which already has the patch referenced in the bz. So it seems like there is another race condition that is harder to hit which causes the same thing. This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 332579 Event posted on 08-19-2009 08:21pm EDT by woodard Herb, Do you have any idea about what triggered the problem. There is a RH BZ that has the exact same results but is a much easier to hit race condition. The description and the patch is in linux-2.6-scsi-fix-oops-after-trying-to-removing-rport-twice.patch: <snip contents of patch and description> However, this particular race condition has obviously been closed but there appears to be another more subtle race condition that you have hit. ------- Comment #8 From Herb Wartens 2009-08-19 17:21:08 [reply] ------- Ben, It looks to me like we are already running with that patch in place, and are still able to hit this race. I don't know the particulars about how we hit this bug, except to say that we were probably just running our user-space application (hpss tape mover) as usual. This application reads/writes from/to a block device that is exposed by the qla2xxx driver (on DDN disk). So far I have only seen a single instance og this crash, so may be a pretty tight race. ------- Comment #10 From Ben Woodard 2009-08-19 17:30:28 [reply] ------- Herb, So you asked me what I think of the idea of just avoiding the null pointer reference. I don't like that idea because it seems more likely that this will must move the bug to a new place. The problem is that we are trying to abort a command in two places. I think that an appropriate response would be to try to figure out what triggers this event and how frequently it is occurring. If it is common in a high throughput utilization that gives us clues as to where to look. If it seldom happens that is also a clue. What is causing the abort of the fc? -ben ------- Comment #11 From Ben Woodard 2009-08-19 17:34:28 [reply] ------- > So far I have only seen a single instance of this crash, so may be a pretty > tight race. Over what amount of time? It might also be triggered by some event on the fc. You don't just ABORT scsi command for no reason. What there a hiccup on the fc? Were people playing with cables? Was it at a particularly busy time on the node? This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 332579 Event posted on 08-19-2009 08:46pm EDT by woodard ------- Comment #12 From Herb Wartens 2009-08-19 17:56:24 [reply] ------- (In reply to comment #10) > Herb, > > So you asked me what I think of the idea of just avoiding the null pointer > reference. I don't like that idea because it seems more likely that this will > must move the bug to a new place. The problem is that we are trying to abort a > command in two places. > Ben, I totally agree that it would not be a fix. I was thinking it could be a workaround until we find the cause of the issue and a better solution. I was thinking this would be safe for the time being (i.e. would not manifest itself as a different bug as you suggest) since qla2x00_dev_loss_tmo_callbk() sets: *((fc_port_t **)rport->dd_data) = NULL; which is what fcport was set to: fc_port_t *fcport = *(fc_port_t **)rport->dd_data; So one the function returns we would expect that rport->dd_data would be set to NULL. I was not suggesting this would be the proper fix. What do you think? > I think that an appropriate response would be to try to figure out what > triggers this event and how frequently it is occurring. If it is common in a > high throughput utilization that gives us clues as to where to look. If it > seldom happens that is also a clue. What is causing the abort of the fc? > > -ben > ------- Comment #13 From Herb Wartens 2009-08-19 17:58:58 [reply] ------- (In reply to comment #11) > > So far I have only seen a single instance of this crash, so may be a pretty > > tight race. > > Over what amount of time? > > It might also be triggered by some event on the fc. You don't just ABORT scsi > command for no reason. What there a hiccup on the fc? Were people playing with > cables? Was it at a particularly busy time on the node? > I don't really have much data about what really was happening on the node at the time. There were really no errors in the console log about scsi commands getting aborted. The day before I did see some strange errors, but I doubt they have any relation: 2009-07-08 16:47:15 qla2xxx 0000:09:00.0: Passthru CT response is not available. 2009-07-08 16:47:15 qla2xxx 0000:09:00.0: Passthru ELS response is not available. 2009-07-08 16:47:16 qla2xxx 0000:09:00.1: Passthru CT response is not available. 2009-07-08 16:47:16 qla2xxx 0000:09:00.1: Passthru ELS response is not available. 2009-07-08 16:47:17 qla2xxx 0000:0a:00.0: Passthru CT response is not available. 2009-07-08 16:47:17 qla2xxx 0000:0a:00.0: Passthru ELS response is not available. 2009-07-08 16:47:18 qla2xxx 0000:0a:00.1: Passthru CT response is not available. 2009-07-08 16:47:18 qla2xxx 0000:0a:00.1: Passthru ELS response is not available. 2009-07-08 16:48:53 qla2xxx 0000:09:00.0: Passthru CT response is not available. 2009-07-08 16:48:53 qla2xxx 0000:09:00.0: Passthru ELS response is not available. 2009-07-08 16:48:53 qla2xxx 0000:09:00.1: Passthru CT response is not available. 2009-07-08 16:48:53 qla2xxx 0000:09:00.1: Passthru ELS response is not available. 2009-07-08 16:48:54 qla2xxx 0000:0a:00.0: Passthru CT response is not available. 2009-07-08 16:48:54 qla2xxx 0000:0a:00.0: Passthru ELS response is not available. 2009-07-08 16:48:55 qla2xxx 0000:0a:00.1: Passthru CT response is not available. 2009-07-08 16:48:55 qla2xxx 0000:0a:00.1: Passthru ELS response is not available. 2009-07-08 16:49:27 qla2xxx 0000:09:00.0: Passthru CT response is not available. 2009-07-08 16:49:27 qla2xxx 0000:09:00.0: Passthru ELS response is not available. 2009-07-08 16:49:27 qla2xxx 0000:09:00.1: Passthru CT response is not available. 2009-07-08 16:49:27 qla2xxx 0000:09:00.1: Passthru ELS response is not available. 2009-07-08 16:49:28 qla2xxx 0000:0a:00.0: Passthru CT response is not available. 2009-07-08 16:49:28 qla2xxx 0000:0a:00.0: Passthru ELS response is not available. 2009-07-08 16:49:29 qla2xxx 0000:0a:00.1: Passthru CT response is not available. 2009-07-08 16:49:29 qla2xxx 0000:0a:00.1: Passthru ELS response is not available. 2009-07-08 16:49:49 qla2xxx 0000:09:00.0: Passthru CT response is not available. 2009-07-08 16:49:49 qla2xxx 0000:09:00.0: Passthru ELS response is not available. 2009-07-08 16:49:49 qla2xxx 0000:09:00.1: Passthru CT response is not available. 2009-07-08 16:49:49 qla2xxx 0000:09:00.1: Passthru ELS response is not available. 2009-07-08 16:49:50 qla2xxx 0000:0a:00.0: Passthru CT response is not available. 2009-07-08 16:49:50 qla2xxx 0000:0a:00.0: Passthru ELS response is not available. 2009-07-08 16:49:51 qla2xxx 0000:0a:00.1: Passthru CT response is not available. 2009-07-08 16:49:51 qla2xxx 0000:0a:00.1: Passthru ELS response is not available. 2009-07-08 16:53:41 qla2xxx 0000:09:00.0: Passthru CT response is not available. 2009-07-08 16:53:41 qla2xxx 0000:09:00.0: Passthru ELS response is not available. 2009-07-08 16:53:42 qla2xxx 0000:09:00.1: Passthru CT response is not available. 2009-07-08 16:53:42 qla2xxx 0000:09:00.1: Passthru ELS response is not available. 2009-07-08 16:53:43 qla2xxx 0000:0a:00.0: Passthru CT response is not available. 2009-07-08 16:53:43 qla2xxx 0000:0a:00.0: Passthru ELS response is not available. 2009-07-08 16:53:43 qla2xxx 0000:0a:00.1: Passthru CT response is not available. 2009-07-08 16:53:43 qla2xxx 0000:0a:00.1: Passthru ELS response is not available. <ConMan> Console [toochase51] log at 2009-07-08 16:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-08 18:00:00 PDT. <ConMan> Console [toochase51] log at 2009-07-08 18:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-08 19:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-08 20:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-08 22:00:00 PDT. <ConMan> Console [toochase51] log at 2009-07-08 22:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-08 23:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-09 00:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-09 01:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-09 03:00:00 PDT. <ConMan> Console [toochase51] log at 2009-07-09 03:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-09 04:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-09 05:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-09 06:59:59 PDT. <ConMan> Console [toochase51] log at 2009-07-09 07:59:59 PDT. 2009-07-09 08:30:50 Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP: ------- Comment #14 From Herb Wartens 2009-08-19 18:00:20 [reply] ------- I have only seen this crash on a single node over the course of about a month. This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 332579 Event posted on 08-19-2009 08:55pm EDT by woodard ------- Comment #15 From Ben Woodard 2009-08-19 18:16:56 [reply] ------- Herb, I agree that those messages are not related. They seem to be related to accessing sysfs files and don't seem to be related to events on the fabric. This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 332579 Has this issue been reproduced in either RHEL 5.5 or the RHEL 5.6 kernel? This is from a few updates back and it would be good to know if it's been reproduced in the last year or so. Another data point is that the function, qla2x00_abort_fcport_cmd(), is not present in the RHEL 5.6 beta kernels. Instead qla2x00_dev_loss_tmo_callbk() will simply make the driver not remember any references to the rport and let the transport and mid-layer deal with the outstanding commands. static void qla2x00_dev_loss_tmo_callbk(struct fc_rport *rport) { scsi_qla_host_t *ha = NULL; struct Scsi_Host *host = rport_to_shost(rport); fc_port_t *fcport = *(fc_port_t **)rport->dd_data; if (!fcport) return; ha = to_qla_parent(fcport->ha); if (test_bit(ABORT_ISP_ACTIVE, &ha->dpc_flags)) { return ; } if (unlikely(pci_channel_offline(fcport->ha->pdev))) { qla2x00_abort_all_cmds(fcport->ha, DID_NO_CONNECT << 16); return; } /* * At this point all fcport's software-states are cleared. Perform any * final cleanup of firmware resources (PCBs and XCBs). */ if (fcport->loop_id != FC_NO_LOOP_ID && !test_bit(UNLOADING, &fcport->ha->dpc_flags)) { fcport->ha->isp_ops->fabric_logout(fcport->ha, fcport->loop_id, fcport->d_id.b.domain, fcport->d_id.b.area, fcport->d_id.b.al_pa); fcport->loop_id = FC_NO_LOOP_ID; } /* * Transport has effectively 'deleted' the rport, clear * all local references. */ spin_lock_irq(host->host_lock); fcport->rport = NULL; *((fc_port_t **)rport->dd_data) = NULL; spin_unlock_irq(host->host_lock); } It's possible that no fix may need to be made to RHEL 5.6 as the semantics of the dev_loss_tmo processing are different. (In reply to comment #7) > Has this issue been reproduced in either RHEL 5.5 or the RHEL 5.6 kernel? This > is from a few updates back and it would be good to know if it's been reproduced > in the last year or so. Please re-test and report the results here. The customer reports that they are no longer using the QLogic driver that ships with RHEL. They have not seen the problem recently, and they could not reproduce the issue at will. Closing this BZ. |