Bug 518522

Summary:	crash in qla2x00_abort_fcport_cmds
Product:	Red Hat Enterprise Linux 5	Reporter:	Issue Tracker <tao>
Component:	kernel	Assignee:	Chad Dupuis (Cavium) <cdupuis>
Status:	CLOSED CANTFIX	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.3	CC:	bdonahue, coughlan, cww, jwest, tao
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-09-10 14:39:20 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	502912, 600363

Description Issue Tracker 2009-08-20 17:39:53 UTC

Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2009-08-20 17:39:55 UTC

Event posted on 08-19-2009 07:29pm EDT by woodard

We are seeing a crash that exactly matches https://bugzilla.redhat.com/show_bug.cgi?id=465945 perfectly but we aren't yet sure what triggered it and this is on a -128.1.x kernel which already has the patch referenced in the bz. So it seems like there is another race condition that is harder to hit which causes the same thing.


This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 332579

Comment 2 Issue Tracker 2009-08-20 17:39:56 UTC

Event posted on 08-19-2009 08:21pm EDT by woodard

Herb,

Do you have any idea about what triggered the problem. There is a RH BZ
that
has the exact same results but is a much easier to hit race condition.
The
description and the patch is in
linux-2.6-scsi-fix-oops-after-trying-to-removing-rport-twice.patch:

<snip contents of patch and description>

However, this particular race condition has obviously been closed but
there
appears to be another more subtle race condition that you have hit. 

 ------- Comment  #8 From Herb Wartens  2009-08-19 17:21:08  [reply]
-------

Ben,
It looks to me like we are already running with that patch in place, and
are
still able to hit this race.  I don't know the particulars about how we
hit
this bug, except to say that we were probably just running our user-space
application (hpss tape mover) as usual.  This application reads/writes
from/to
a block device that is exposed by the qla2xxx driver (on DDN disk).

So far I have only seen a single instance og this crash, so may be a
pretty
tight race.

------- Comment #10 From Ben Woodard 2009-08-19 17:30:28 [reply] -------

Herb, 

So you asked me what I think of the idea of just avoiding the null
pointer
reference. I don't like that idea because it seems more likely that this
will
must move the bug to a new place. The problem is that we are trying to
abort a
command in two places.

I think that an appropriate response would be to try to figure out what
triggers this event and how frequently it is occurring. If it is common in
a
high throughput utilization that gives us clues as to where to look. If
it
seldom happens that is also a clue. What is causing the abort of the fc?

-ben

------- Comment #11 From Ben Woodard 2009-08-19 17:34:28 [reply] -------

> So far I have only seen a single instance of this crash, so may be a
pretty
> tight race.

Over what amount of time?

It might also be triggered by some event on the fc. You don't just ABORT
scsi
command for no reason. What there a hiccup on the fc? Were people playing
with
cables? Was it at a particularly busy time on the node?




This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 332579

Comment 3 Issue Tracker 2009-08-20 17:39:58 UTC

Event posted on 08-19-2009 08:46pm EDT by woodard

  ------- Comment  #12 From Herb Wartens  2009-08-19 17:56:24  [reply]
-------

(In reply to comment #10)
> Herb, 
> 
> So you asked me what I think of the idea of just avoiding the null
pointer
> reference. I don't like that idea because it seems more likely that
this will
> must move the bug to a new place. The problem is that we are trying to
abort a
> command in two places.
> 

Ben,
I totally agree that it would not be a fix.  I was thinking it could be a
workaround until we find the cause of the issue and a better solution.  I
was
thinking this would be safe for the time being (i.e. would not manifest
itself
as a different bug as you suggest) since
qla2x00_dev_loss_tmo_callbk() sets:
*((fc_port_t **)rport->dd_data) = NULL;
which is what fcport was set to:
fc_port_t *fcport = *(fc_port_t **)rport->dd_data;
So one the function returns we would expect that rport->dd_data would be
set to
NULL.  I was not suggesting this would be the proper fix.  What do you
think?

> I think that an appropriate response would be to try to figure out what
> triggers this event and how frequently it is occurring. If it is common
in a
> high throughput utilization that gives us clues as to where to look. If
it
> seldom happens that is also a clue. What is causing the abort of the
fc?
> 
> -ben
> 

------- Comment #13 From Herb Wartens 2009-08-19 17:58:58 [reply] -------

(In reply to comment #11)
> > So far I have only seen a single instance of this crash, so may be a
pretty
> > tight race.
> 
> Over what amount of time?
> 
> It might also be triggered by some event on the fc. You don't just
ABORT scsi
> command for no reason. What there a hiccup on the fc? Were people
playing with
> cables? Was it at a particularly busy time on the node?
> 

I don't really have much data about what really was happening on the node
at
the time.  There were really no errors in the console log about scsi
commands
getting aborted.  The day before I did see some strange errors, but I
doubt
they have any relation:

2009-07-08 16:47:15 qla2xxx 0000:09:00.0: Passthru CT response is not
available.
2009-07-08 16:47:15 qla2xxx 0000:09:00.0: Passthru ELS response is not
available.
2009-07-08 16:47:16 qla2xxx 0000:09:00.1: Passthru CT response is not
available.
2009-07-08 16:47:16 qla2xxx 0000:09:00.1: Passthru ELS response is not
available.
2009-07-08 16:47:17 qla2xxx 0000:0a:00.0: Passthru CT response is not
available.
2009-07-08 16:47:17 qla2xxx 0000:0a:00.0: Passthru ELS response is not
available.
2009-07-08 16:47:18 qla2xxx 0000:0a:00.1: Passthru CT response is not
available.
2009-07-08 16:47:18 qla2xxx 0000:0a:00.1: Passthru ELS response is not
available.
2009-07-08 16:48:53 qla2xxx 0000:09:00.0: Passthru CT response is not
available.
2009-07-08 16:48:53 qla2xxx 0000:09:00.0: Passthru ELS response is not
available.
2009-07-08 16:48:53 qla2xxx 0000:09:00.1: Passthru CT response is not
available.
2009-07-08 16:48:53 qla2xxx 0000:09:00.1: Passthru ELS response is not
available.
2009-07-08 16:48:54 qla2xxx 0000:0a:00.0: Passthru CT response is not
available.
2009-07-08 16:48:54 qla2xxx 0000:0a:00.0: Passthru ELS response is not
available.
2009-07-08 16:48:55 qla2xxx 0000:0a:00.1: Passthru CT response is not
available.
2009-07-08 16:48:55 qla2xxx 0000:0a:00.1: Passthru ELS response is not
available.
2009-07-08 16:49:27 qla2xxx 0000:09:00.0: Passthru CT response is not
available.
2009-07-08 16:49:27 qla2xxx 0000:09:00.0: Passthru ELS response is not
available.
2009-07-08 16:49:27 qla2xxx 0000:09:00.1: Passthru CT response is not
available.
2009-07-08 16:49:27 qla2xxx 0000:09:00.1: Passthru ELS response is not
available.
2009-07-08 16:49:28 qla2xxx 0000:0a:00.0: Passthru CT response is not
available.
2009-07-08 16:49:28 qla2xxx 0000:0a:00.0: Passthru ELS response is not
available.
2009-07-08 16:49:29 qla2xxx 0000:0a:00.1: Passthru CT response is not
available.
2009-07-08 16:49:29 qla2xxx 0000:0a:00.1: Passthru ELS response is not
available.
2009-07-08 16:49:49 qla2xxx 0000:09:00.0: Passthru CT response is not
available.
2009-07-08 16:49:49 qla2xxx 0000:09:00.0: Passthru ELS response is not
available.
2009-07-08 16:49:49 qla2xxx 0000:09:00.1: Passthru CT response is not
available.
2009-07-08 16:49:49 qla2xxx 0000:09:00.1: Passthru ELS response is not
available.
2009-07-08 16:49:50 qla2xxx 0000:0a:00.0: Passthru CT response is not
available.
2009-07-08 16:49:50 qla2xxx 0000:0a:00.0: Passthru ELS response is not
available.
2009-07-08 16:49:51 qla2xxx 0000:0a:00.1: Passthru CT response is not
available.
2009-07-08 16:49:51 qla2xxx 0000:0a:00.1: Passthru ELS response is not
available.
2009-07-08 16:53:41 qla2xxx 0000:09:00.0: Passthru CT response is not
available.
2009-07-08 16:53:41 qla2xxx 0000:09:00.0: Passthru ELS response is not
available.
2009-07-08 16:53:42 qla2xxx 0000:09:00.1: Passthru CT response is not
available.
2009-07-08 16:53:42 qla2xxx 0000:09:00.1: Passthru ELS response is not
available.
2009-07-08 16:53:43 qla2xxx 0000:0a:00.0: Passthru CT response is not
available.
2009-07-08 16:53:43 qla2xxx 0000:0a:00.0: Passthru ELS response is not
available.
2009-07-08 16:53:43 qla2xxx 0000:0a:00.1: Passthru CT response is not
available.
2009-07-08 16:53:43 qla2xxx 0000:0a:00.1: Passthru ELS response is not
available.

<ConMan> Console [toochase51] log at 2009-07-08 16:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-08 18:00:00 PDT.

<ConMan> Console [toochase51] log at 2009-07-08 18:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-08 19:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-08 20:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-08 22:00:00 PDT.

<ConMan> Console [toochase51] log at 2009-07-08 22:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-08 23:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-09 00:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-09 01:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-09 03:00:00 PDT.

<ConMan> Console [toochase51] log at 2009-07-09 03:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-09 04:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-09 05:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-09 06:59:59 PDT.

<ConMan> Console [toochase51] log at 2009-07-09 07:59:59 PDT.
2009-07-09 08:30:50 Unable to handle kernel NULL pointer dereference at
0000000000000010 RIP:

------- Comment #14 From Herb Wartens 2009-08-19 18:00:20 [reply] -------

I have only seen this crash on a single node over the course of about a
month.




This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 332579

Comment 4 Issue Tracker 2009-08-20 17:40:00 UTC

Event posted on 08-19-2009 08:55pm EDT by woodard

  ------- Comment  #15 From Ben Woodard  2009-08-19 18:16:56  [reply]
-------

Herb, I agree that those messages are not related. They seem to be related
to
accessing sysfs files and don't seem to be related to events on the
fabric.




This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 332579

Comment 7 Chad Dupuis (Cavium) 2010-08-24 17:37:19 UTC

Has this issue been reproduced in either RHEL 5.5 or the RHEL 5.6 kernel?  This is from a few updates back and it would be good to know if it's been reproduced in the last year or so.

Comment 8 Chad Dupuis (Cavium) 2010-09-01 15:45:57 UTC

Another data point is that the function, qla2x00_abort_fcport_cmd(), is not present in the RHEL 5.6 beta kernels.  Instead qla2x00_dev_loss_tmo_callbk() will simply make the driver not remember any references to the rport and let the transport and mid-layer deal with the outstanding commands.

static void
qla2x00_dev_loss_tmo_callbk(struct fc_rport *rport)
{
        scsi_qla_host_t *ha = NULL;
        struct Scsi_Host *host = rport_to_shost(rport);
        fc_port_t *fcport = *(fc_port_t **)rport->dd_data;

        if (!fcport)
                return;

        ha = to_qla_parent(fcport->ha);
        if (test_bit(ABORT_ISP_ACTIVE, &ha->dpc_flags)) {
                return ;
        }

        if (unlikely(pci_channel_offline(fcport->ha->pdev))) {
                qla2x00_abort_all_cmds(fcport->ha, DID_NO_CONNECT << 16);
                return;
        }

        /*
         * At this point all fcport's software-states are cleared.  Perform any
         * final cleanup of firmware resources (PCBs and XCBs).
         */
        if (fcport->loop_id != FC_NO_LOOP_ID &&
            !test_bit(UNLOADING, &fcport->ha->dpc_flags)) {
                fcport->ha->isp_ops->fabric_logout(fcport->ha,
                        fcport->loop_id, fcport->d_id.b.domain,
                        fcport->d_id.b.area, fcport->d_id.b.al_pa);
                fcport->loop_id = FC_NO_LOOP_ID;
        }

        /*
         * Transport has effectively 'deleted' the rport, clear
         * all local references.
         */
        spin_lock_irq(host->host_lock);
        fcport->rport = NULL;
        *((fc_port_t **)rport->dd_data) = NULL;
        spin_unlock_irq(host->host_lock);
}

It's possible that no fix may need to be made to RHEL 5.6 as the semantics of the dev_loss_tmo processing are different.

Comment 9 Tom Coughlan 2010-09-02 17:39:58 UTC

(In reply to comment #7)
> Has this issue been reproduced in either RHEL 5.5 or the RHEL 5.6 kernel?  This
> is from a few updates back and it would be good to know if it's been reproduced
> in the last year or so.

Please re-test and report the results here.

Comment 12 Tom Coughlan 2010-09-10 14:39:20 UTC

The customer reports that they are no longer using the QLogic driver that ships with RHEL. They have not seen the problem recently, and they could not reproduce the issue at will. Closing this BZ.