Bug 141262 - (IT_36994) The scsi error recovery thread may not get waken up properly
The scsi error recovery thread may not get waken up properly
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel (Show other bugs)
2.1
All Linux
medium Severity medium
: ---
: ---
Assigned To: Brian Maly
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-11-29 16:39 EST by Wendy Cheng
Modified: 2010-10-21 22:43 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-05-16 11:03:49 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Wendy Cheng 2004-11-29 16:39:01 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1)
Gecko/20020830

Description of problem:

IT#36994

The scsi error recovery thread (scsi_error_handler) gets waken up in
scsi_bottom_half_handler() based on the following logic
(drivers/scsi/scsi.c):


  1281    if (atomic_read(&SCpnt->host->host_busy) == 
                                 SCpnt->host->host_failed) {
  ....                                                                   
  1284        up(SCpnt->host->eh_wait);
  1285    }

However, a race condition as depicted in the following flow would miss
the invoking of scsi_error_handler (via eh_wait semophore):

write I/O ----------->
                    | host->host_busy++
                    +----------------->
                                      +------------------>
                                      -------------------+ success
                    ------------------+ success
                    |
         <----------+

write I/O ----------->
                    | host->host_busy++
                    |

                                      <------------------ Interrupt
                    <------------------ with error
                    | host->host_failed++
                    | host->in_recovery
                    | if(host->host_busy == host->host_failed)
                    |    then wakeup error recovery thread;
                    |  o The busy & failed not equal.
                    ------------------>
                                      ------------------->

                    | o detect the error
                    | host->host_busy--;
                    |
         <----------+

                    x don't start error recovery thread



Version-Release number of selected component (if applicable):
2.4.9-e.51

How reproducible:
Sometimes

Steps to Reproduce:
1. A SCSI command is in the scsi_dispatch_cmd function.
2. An error occures.
3. The SCSI command is requeueu by the scsi_mlqueue_insert.

    

Actual Results:  The error recovery thread is not started.

Expected Results:  The error recovery thread is started.

Additional info:
Comment 1 Wendy Cheng 2004-11-29 16:44:41 EST
Prototype patch from Fujitsu:

diff -Nur linux.org/drivers/scsi/scsi.c linux/drivers/scsi/scsi.c
--- linux.org/drivers/scsi/scsi.c       Mon Feb 23 17:20:29 2004
+++ linux/drivers/scsi/scsi.c   Mon Feb 23 18:00:28 2004
@@ -692,6 +693,14 @@
               return 1;
       }
       */
+
+       if (host->in_recovery) {
+               scsi_delete_timer(SCpnt);
+               scsi_mlqueue_insert(SCpnt, SCSI_MLQUEUE_HOST_BUSY);
+               SCSI_LOG_MLQUEUE(3, printk("scsi_dispatch_cmd :
request rejected\n"));
+               return 1;
+       }
+
       if (host->can_queue) {
               SCSI_LOG_MLQUEUE(3, printk("queuecommand : routine at
%p\n",
                                          host->hostt->queuecommand));
diff -Nur linux.org/drivers/scsi/scsi_queue.c
linux/drivers/scsi/scsi_queue.c
--- linux.org/drivers/scsi/scsi_queue.c Mon Feb 23 17:20:29 2004
+++ linux/drivers/scsi/scsi_queue.c     Mon Feb 23 17:22:03 2004
@@ -140,6 +140,12 @@
       atomic_dec(&cmd->host->host_busy);
       atomic_dec(&cmd->device->device_busy);

+       if (cmd->host->in_recovery &&
atomic_read(&cmd->host->host_busy) == cmd->host->host_failed) {
+               SCSI_LOG_ERROR_RECOVERY(5, printk("scsi_mlqueue_insert
: Waking error handler thread (%d)\n",
+                                                
atomic_read(&cmd->host->eh_wait->count)));
+               up(cmd->host->eh_wait);
+       }
+
       /*
        * Insert this command at the head of the queue for it's device.
        * It will go before all other commands that are already in the
queue.
Comment 2 Wendy Cheng 2004-11-29 16:45:53 EST
Ditto for RHEL 3. Let me know if another bugzilla is required.
Comment 3 Ernie Petrides 2004-12-02 21:59:37 EST
Wendy, if the fix needs to go in RHEL3, then yes we need an addition
bugzilla report (because our Errata need to reference different BZs,
as does our management).  When you open the RHEL3 one, please also
reference this BZ in a comment in the new BZ.  Thanks in advance.
Comment 5 Jim Paradis 2005-05-05 19:11:57 EDT
Adding dledford to cc list for sanity checking.  Doug, does the patch in Comment
#1 look good to you?

Note You need to log in before you can comment on or make changes to this bug.