This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 106399 - SCSI I/O stall problem
SCSI I/O stall problem
Status: CLOSED DUPLICATE of bug 106146
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
s390 Linux
medium Severity medium
: ---
: ---
Assigned To: Doug Ledford
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-10-06 15:00 EDT by Ingolf Salm
Modified: 2007-11-30 17:06 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-02-21 13:58:59 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch as proposed by LTC bugzilla 3876 (460 bytes, patch)
2003-10-07 05:44 EDT, Martin Peschke
no flags Details | Diff

  None (edit)
Description Ingolf Salm 2003-10-06 15:00:01 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Description of problem:
This problem is described in LTC bugzilla 3867. To resolve the problem a 
common code patch for severe SCSI I/O stall problem was sent to RH for review 
and integration.
Please contact Pete Zaitcev for more information

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1..
2.
3.
    

Additional info:
Comment 1 Pete Zaitcev 2003-10-06 15:12:27 EDT
I do not know anything about LTC 3867.

Ingolf, please provide something tangible to identify the issue.
Comment 2 Martin Peschke 2003-10-07 05:08:43 EDT
This is an experience from the pSeries-land (copied form LTC bugzilla 3876):

When issuing the WRITE_BUFFER command, there are 5 commands (ibmsis device 
driver queue depth) that are queued up behind the WRITE_BUFFER.  After the 
WRITE_BUFFER completes the commands all fail since the device is not yet ready 
for commands.  This causes the device driver to go into recovery.  As part of 
this recovery, the device driver blocks new commands from being processed.  
When the scsi mid-layer issues a command to the device driver which is then 
blocked, I found that the command ends up on the MLQUEUE.  It appears that the 
error handler routines regard the commands on the MLQUEUE as active commands 
and will not proceed in error processing when commands are on the MLQUEUE.  
Unfortunately, commands are not issued from the MLQUEUE because the device 
state is device_blocked.  This appears to have caused the scsi mid-layer 
lockup I experienced.

As a quick test I changed scsi_unjam_host to also ignore SCSI_STATE_MLQUEUE as
an active command.  With that change I was able to issue the WRITE_BUFFER and
resume device operations afterwards.


This is our own zSeries analysis:

The recipe for this one is: deliver some QUEUE_FULL stati to Linux, along with 
something else that fails or times out. The former make the corresponding 
commands end up in a retriable, waiting state, the latter cause the recovery to 
be started after all pending commands have been finished or timed out. The 
mishap occurs, when the recovery thread gets running for the timed out or failed 
commands and tries to make sure that I/O has been quiesced (it prefers not to be 
disturbed during the fullfillment of its arduous duty). It mistakenly interprets
those retriable, unfinished commands (the QUEUE_FULL ones) as active commands 
which hinder recovery. (Why doesn't it cry out loud then at least, argh?!?!) In 
consequence of this, it falls asleep again assuming to be woken up another time 
when these active commands have been finished. But these unfinished commands 
(still the QUEUE_FULL ones) just doze till I/O is resumed (usually done after 
the recovery is finished), while the recovery is taking a veeeerry long nap.

Comment 3 Martin Peschke 2003-10-07 05:44:47 EDT
Created attachment 94974 [details]
patch as proposed by LTC bugzilla 3876
Comment 4 Pete Zaitcev 2003-10-08 15:34:28 EDT
OK, so LTC 3867 was a typo, it's LTC 3876. Duping with bz#105146.
Comment 5 Pete Zaitcev 2003-10-08 15:35:01 EDT

*** This bug has been marked as a duplicate of 104146 ***
Comment 6 Pete Zaitcev 2003-10-08 15:36:41 EDT

*** This bug has been marked as a duplicate of 106146 ***
Comment 7 Martin Peschke 2003-10-09 05:30:14 EDT
Some thoughts exchanged via mail that I think belong here:

------------------------------------------------------------------------

from Doug replying to Pete

> Here's an example from bz#106146:
>   After the WRITE_BUFFER completes the commands all fail since the
>   device is not yet ready for commands.  This causes the device driver
>   to go into recovery.  As part of this recovery, the device driver
>   blocks new commands from being processed. When the scsi mid-layer
>   issues a command to the device driver which is then blocked,
>   I found that the command ends up on the MLQUEUE.  [...]
> 
> OK, it's a good attempt - almost what I need, BUT it assumes
> a very good familiarity with the code. May be enough for Doug.
> How exactly does the command ends in mlqueue? Device driver
> blocks commands ... how? What calls what and how does the
> command status changes, at what point host_busy changes?
> 
> In case of host_busy, I can see where you are going with
> the patch... maybe. Doug moved increment of host_busy up the
> scsi_request_fn a little. So, if the command ends on the
> mqueue at or between calls to blkdev_entry_next_request
> and scsi_init_cmd_errh, old check in scsi_mlqueue_insert
> may be wrong. Now, is that so? What is precisely they
> erroneous path of the command?
> 
> I would appreciate a timely response. It would be very helpful.

I think I can answer this one.  This is from memory, so don't shoot me
if I have details wrong.

Due to the device being busy after the write buffer command, following
commands get delayed.  This delay causes them to get put on the MLQUEUE
(Mid-Layer QUEUE).  Now, there is a known bug in the MLQUEUE
implementation.  If a command gets put on the MLQUEUE and there is no
outstanding command that will complete eventually, the command sits on
the MLQUEUE forever.  This is because there is no timer to kick the
MLQUEUE.  Right now, the MLQUEUE is *only* kicked on command
completion.  (This can be seen very easily by having a device return
QUEUE_FULL when there are 0 outstanding commands)  Now, in the patch in
the bugzilla, they solved the problem of commands on the MLQUEUE not
being treated as inactive by the error handler, which I think is OK. 
But, I don't think they addressed the real source of the problem in the
first place, which is commands placed on the MLQUEUE need to have their
timer reset to a very short time in the future (maybe HZ/5) and then on
timeout, if the command is on the MLQUEUE, remove it from the queue and
call scsi_dispatch_command to resend the command.  This will result in
the command being retried 5 times a second until it completes (as long
as the error on failure is just a delay type error).  So, Pete, I'm
assigned to this one and I'll get the full change in (I've just been too
busy on other stuff to do it so far).  It's OK as far as it goes, but
needs a bit more to be complete.

-------------------------------------------------------------------------

from Martin to Doug:

Doug, Pete

Mike Anderson is on cc now since he provided the fix we are
discussing here.
(Mike, it's about LTC bugzilla #3876. Do you have access to the
corresponding RedHat bugzilla #106399?)

> Now, there is a known bug in the MLQUEUE
> implementation.  If a command gets put on the MLQUEUE and there is no
> outstanding command that will complete eventually, the command sits on
> the MLQUEUE forever.  This is because there is no timer to kick the
> MLQUEUE.

That's certainly true.

> Right now, the MLQUEUE is *only* kicked on command
> completion.  (This can be seen very easily by having a device return
> QUEUE_FULL when there are 0 outstanding commands)  Now, in the patch in
> the bugzilla, they solved the problem of commands on the MLQUEUE not
> being treated as inactive by the error handler, which I think is OK. 
> But, I don't think they addressed the real source of the problem in the
> first place, which is commands placed on the MLQUEUE need to have their
> timer reset to a very short time in the future (maybe HZ/5) and then on
> timeout, if the command is on the MLQUEUE, remove it from the queue and
> call scsi_dispatch_command to resend the command.  This will result in
> the command being retried 5 times a second until it completes (as long
> as the error on failure is just a delay type error).

It seems to boil down to the question whether we want to be done with
mlqueue commands prior to having the error recovery thread doing its
work on other commands, or whether we allow commands to doze in the
mlqueue until the error recovery thread has finished. Your proposal
is about the former, Mike's patch refers to the latter. And without a
patch we either deadlock in the error recovery (as described in #106399)
or some device starves (as described above).

Having read Doug's comment, I am allured by the idea to have all
mlqueue commands finished prior to waking up the recovery thread.
I am wondering whether any command successfully recovered and hence
finished could trigger mlqueue operation to be continued while the
recovery thread is still up. Given the fragility of the 2.4 SCSI code,
it might be bad to have something else going during recovery. But,I am
not sure whether that would be possible at all. Might be a fishy thought.

> So, Pete, I'm
> assigned to this one and I'll get the full change in (I've just been too
> busy on other stuff to do it so far).  It's OK as far as it goes, but
> needs a bit more to be complete.

Would this timer approach also be used for commands which have been
answered with BUSY SCSI status by devices? Currently, these commands
are retried immediately. I think, a delay is also needed there.

-----------------------------------------------------------------------------

Doug's reply to Martin:

> Would this timer approach also be used for commands which have been
> answered with BUSY SCSI status by devices? Currently, these commands
> are retried immediately. I think, a delay is also needed there.

Yes, that was part of my intent with this.  Solve more than one problem
with the patch.  I would have to change the MLQUEUE somewhat (currently
it sticks the command back on the request queue itself with all the
other unprocessed commands, that's not right for what I want, so each
device will have to have a new delayed_queue added to the device struct,
and the queue will just be a struct list_head item actually, and then
I'll use the scsi timer to actually timeout the command, and use
scsi_timesout to take the command off this delayed list and resend it,
I'll also make a slight change to the scsi_request_fn() so that whenever
the request queue does get kicked, it checks the delayed_queue on the
device first, if there are delayed commands then return without
processing the queue, the delayed commands should go first to preserve
ordering (not used now but might be in the future) and finally modify
all the REDO type stuff in the scsi done routines so that on a redo we
don't immediately hit the device again since that can actually lock the
scsi bus up with infinite retries for those commands).  But, all in all,
I think that's a highly preferable solution to the problem.  Now, you
also want it to not lockup in the event of a failure of some sort, so
the error recovery change is also needed (with some slight modifications
to how it handles commands on the MLQUEUE obviously to match the other
modifications).  And, obviously, all of this will have to be a
coordinated patch that covers both old and new eh style drivers ;-)

Comment 8 Martin Peschke 2003-10-09 05:33:04 EDT
I am not able to access #106146 which this bug is supposed to be a duplicate of.
Could someone give me access?
Comment 9 Red Hat Bugzilla 2006-02-21 13:58:59 EST
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.

Note You need to log in before you can comment on or make changes to this bug.