Bug 127066 - Panic is occurring in the I/O completion interrupt handling for the character interface driver (sg).
Summary: Panic is occurring in the I/O completion interrupt handling for the character...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Tom Coughlan
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 132991
TreeView+ depends on / blocked
 
Reported: 2004-07-01 13:14 UTC by Heather Conway
Modified: 2007-11-30 22:07 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-05-18 13:27:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch for RHEL 3.0 U2 sg.cb (1.50 KB, text/plain)
2004-07-01 13:19 UTC, Heather Conway
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:294 0 normal SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 5 2005-05-18 04:00:00 UTC

Description Heather Conway 2004-07-01 13:14:56 UTC
Description of problem:
Panic is occurring in the I/O completion interrupt handling for the 
character interface driver (sg).  (race condition)

The sg i/o completion bottom half handler code in sg_cmd_done_bh() is 
waking up a process
via wake_up_interruptible() before calling kill_fasync() to cleanup 
any async helper structure.
Unfortunately, under rare conditions on a multi-processor host, the 
process awoken via
wake_up_interruptible() may be able to both call sg_read() to 
retrieve the completed sg i/o
and call sg_release() to terminate the established sg session before 
the sg i/o completion
handling code gets to call kill_fasync().b

In this case, the sg_release() code will de-allocate the memory used 
for the sg session data structure
via sg_fasync() calling fasync_helper(), possibly even returning the 
physical page and unmapping the
virtual address to the page.  Any attempt to de-reference through 
this address in kill_fasync() will panic
the host.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Heather Conway 2004-07-01 13:19:23 UTC
Created attachment 101559 [details]
patch for RHEL 3.0 U2 sg.cb

Comment 3 Tom Coughlan 2004-07-28 18:52:31 UTC
The patch looks okay to me.  I would like to have it reviewed on the
linux-scsi list, and incorporated upstream if possible.  Would the
author of this patch be willing to post this to linux-scsi?  (Use a
unified diff when posting to the Linux lists.)

If not, I will post it. If it passes review, the patch will be in U4.

Tom

Comment 4 Heather Conway 2004-09-24 14:30:55 UTC
Would you please take the lead on this issue and post the patch for 
me Tom?
Thanks

Comment 5 Tom Coughlan 2004-09-27 20:23:51 UTC
Steven Tweedie reviewed the patch, and does not think it is a
sufficient fix:

It's an interruptible task, so there's nothing at all to stop
user-space from signalling or timing out or otherwise continuing on
its own independently of the wake_up_interruptible().  In that case,
there's nothing to stop the race happening *before* we take the copy
of sfp->async_qp. 

All we're doing here is fixing the most likely cause of the
wakeup/read()/sg_release().  The patch seems to be saying that there
are plenty of other ways of reaching the same race which are not
addressed by the patch.  Don't we really need to move the
kill_fasync() up to within the locking, before we let go of the
command completely?
---

I am looking at whether we can find a suitable solution before U4 freezes.

Tom

Comment 6 Tom Coughlan 2004-11-11 12:36:41 UTC
A patch has been proposed upstream, and looks like it will be accepted.

http://marc.theaimsgroup.com/?l=linux-scsi&m=109936088901128&w=2

Please test to confirm this fixes your problem.  This is a candidaate
for RHEL 3 U5.

Comment 8 Heather Conway 2005-01-28 17:03:07 UTC
This bug was found when PowerPath contained a volume manage and the 
problem isn't being replicated.  The PowerPath team is reviewing the 
code and will provide an update.

Comment 9 Heather Conway 2005-01-28 17:04:29 UTC
To clarify, there is no longer a volume manager included in the 
PowerPath package and the problem isn't being replicated.  

Comment 10 Ernie Petrides 2005-02-15 08:42:41 UTC
A fix for this problem has just been committed to the RHEL3 U5
patch pool this evening (in kernel version 2.4.21-27.13.EL).


Comment 11 Tim Powers 2005-05-18 13:27:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html



Note You need to log in before you can comment on or make changes to this bug.