Bug 106399
| Summary: | SCSI I/O stall problem | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 3 | Reporter: | Ingolf Salm <salm> | ||||
| Component: | kernel | Assignee: | Doug Ledford <dledford> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Brian Brock <bbrock> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 3.0 | CC: | mpeschke, petrides | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | s390 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2006-02-21 18:58:59 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
I do not know anything about LTC 3867. Ingolf, please provide something tangible to identify the issue. This is an experience from the pSeries-land (copied form LTC bugzilla 3876): When issuing the WRITE_BUFFER command, there are 5 commands (ibmsis device driver queue depth) that are queued up behind the WRITE_BUFFER. After the WRITE_BUFFER completes the commands all fail since the device is not yet ready for commands. This causes the device driver to go into recovery. As part of this recovery, the device driver blocks new commands from being processed. When the scsi mid-layer issues a command to the device driver which is then blocked, I found that the command ends up on the MLQUEUE. It appears that the error handler routines regard the commands on the MLQUEUE as active commands and will not proceed in error processing when commands are on the MLQUEUE. Unfortunately, commands are not issued from the MLQUEUE because the device state is device_blocked. This appears to have caused the scsi mid-layer lockup I experienced. As a quick test I changed scsi_unjam_host to also ignore SCSI_STATE_MLQUEUE as an active command. With that change I was able to issue the WRITE_BUFFER and resume device operations afterwards. This is our own zSeries analysis: The recipe for this one is: deliver some QUEUE_FULL stati to Linux, along with something else that fails or times out. The former make the corresponding commands end up in a retriable, waiting state, the latter cause the recovery to be started after all pending commands have been finished or timed out. The mishap occurs, when the recovery thread gets running for the timed out or failed commands and tries to make sure that I/O has been quiesced (it prefers not to be disturbed during the fullfillment of its arduous duty). It mistakenly interprets those retriable, unfinished commands (the QUEUE_FULL ones) as active commands which hinder recovery. (Why doesn't it cry out loud then at least, argh?!?!) In consequence of this, it falls asleep again assuming to be woken up another time when these active commands have been finished. But these unfinished commands (still the QUEUE_FULL ones) just doze till I/O is resumed (usually done after the recovery is finished), while the recovery is taking a veeeerry long nap. Created attachment 94974 [details]
patch as proposed by LTC bugzilla 3876
OK, so LTC 3867 was a typo, it's LTC 3876. Duping with bz#105146. *** This bug has been marked as a duplicate of 104146 *** *** This bug has been marked as a duplicate of 106146 *** Some thoughts exchanged via mail that I think belong here: ------------------------------------------------------------------------ from Doug replying to Pete > Here's an example from bz#106146: > After the WRITE_BUFFER completes the commands all fail since the > device is not yet ready for commands. This causes the device driver > to go into recovery. As part of this recovery, the device driver > blocks new commands from being processed. When the scsi mid-layer > issues a command to the device driver which is then blocked, > I found that the command ends up on the MLQUEUE. [...] > > OK, it's a good attempt - almost what I need, BUT it assumes > a very good familiarity with the code. May be enough for Doug. > How exactly does the command ends in mlqueue? Device driver > blocks commands ... how? What calls what and how does the > command status changes, at what point host_busy changes? > > In case of host_busy, I can see where you are going with > the patch... maybe. Doug moved increment of host_busy up the > scsi_request_fn a little. So, if the command ends on the > mqueue at or between calls to blkdev_entry_next_request > and scsi_init_cmd_errh, old check in scsi_mlqueue_insert > may be wrong. Now, is that so? What is precisely they > erroneous path of the command? > > I would appreciate a timely response. It would be very helpful. I think I can answer this one. This is from memory, so don't shoot me if I have details wrong. Due to the device being busy after the write buffer command, following commands get delayed. This delay causes them to get put on the MLQUEUE (Mid-Layer QUEUE). Now, there is a known bug in the MLQUEUE implementation. If a command gets put on the MLQUEUE and there is no outstanding command that will complete eventually, the command sits on the MLQUEUE forever. This is because there is no timer to kick the MLQUEUE. Right now, the MLQUEUE is *only* kicked on command completion. (This can be seen very easily by having a device return QUEUE_FULL when there are 0 outstanding commands) Now, in the patch in the bugzilla, they solved the problem of commands on the MLQUEUE not being treated as inactive by the error handler, which I think is OK. But, I don't think they addressed the real source of the problem in the first place, which is commands placed on the MLQUEUE need to have their timer reset to a very short time in the future (maybe HZ/5) and then on timeout, if the command is on the MLQUEUE, remove it from the queue and call scsi_dispatch_command to resend the command. This will result in the command being retried 5 times a second until it completes (as long as the error on failure is just a delay type error). So, Pete, I'm assigned to this one and I'll get the full change in (I've just been too busy on other stuff to do it so far). It's OK as far as it goes, but needs a bit more to be complete. ------------------------------------------------------------------------- from Martin to Doug: Doug, Pete Mike Anderson is on cc now since he provided the fix we are discussing here. (Mike, it's about LTC bugzilla #3876. Do you have access to the corresponding RedHat bugzilla #106399?) > Now, there is a known bug in the MLQUEUE > implementation. If a command gets put on the MLQUEUE and there is no > outstanding command that will complete eventually, the command sits on > the MLQUEUE forever. This is because there is no timer to kick the > MLQUEUE. That's certainly true. > Right now, the MLQUEUE is *only* kicked on command > completion. (This can be seen very easily by having a device return > QUEUE_FULL when there are 0 outstanding commands) Now, in the patch in > the bugzilla, they solved the problem of commands on the MLQUEUE not > being treated as inactive by the error handler, which I think is OK. > But, I don't think they addressed the real source of the problem in the > first place, which is commands placed on the MLQUEUE need to have their > timer reset to a very short time in the future (maybe HZ/5) and then on > timeout, if the command is on the MLQUEUE, remove it from the queue and > call scsi_dispatch_command to resend the command. This will result in > the command being retried 5 times a second until it completes (as long > as the error on failure is just a delay type error). It seems to boil down to the question whether we want to be done with mlqueue commands prior to having the error recovery thread doing its work on other commands, or whether we allow commands to doze in the mlqueue until the error recovery thread has finished. Your proposal is about the former, Mike's patch refers to the latter. And without a patch we either deadlock in the error recovery (as described in #106399) or some device starves (as described above). Having read Doug's comment, I am allured by the idea to have all mlqueue commands finished prior to waking up the recovery thread. I am wondering whether any command successfully recovered and hence finished could trigger mlqueue operation to be continued while the recovery thread is still up. Given the fragility of the 2.4 SCSI code, it might be bad to have something else going during recovery. But,I am not sure whether that would be possible at all. Might be a fishy thought. > So, Pete, I'm > assigned to this one and I'll get the full change in (I've just been too > busy on other stuff to do it so far). It's OK as far as it goes, but > needs a bit more to be complete. Would this timer approach also be used for commands which have been answered with BUSY SCSI status by devices? Currently, these commands are retried immediately. I think, a delay is also needed there. ----------------------------------------------------------------------------- Doug's reply to Martin: > Would this timer approach also be used for commands which have been > answered with BUSY SCSI status by devices? Currently, these commands > are retried immediately. I think, a delay is also needed there. Yes, that was part of my intent with this. Solve more than one problem with the patch. I would have to change the MLQUEUE somewhat (currently it sticks the command back on the request queue itself with all the other unprocessed commands, that's not right for what I want, so each device will have to have a new delayed_queue added to the device struct, and the queue will just be a struct list_head item actually, and then I'll use the scsi timer to actually timeout the command, and use scsi_timesout to take the command off this delayed list and resend it, I'll also make a slight change to the scsi_request_fn() so that whenever the request queue does get kicked, it checks the delayed_queue on the device first, if there are delayed commands then return without processing the queue, the delayed commands should go first to preserve ordering (not used now but might be in the future) and finally modify all the REDO type stuff in the scsi done routines so that on a redo we don't immediately hit the device again since that can actually lock the scsi bus up with infinite retries for those commands). But, all in all, I think that's a highly preferable solution to the problem. Now, you also want it to not lockup in the event of a failure of some sort, so the error recovery change is also needed (with some slight modifications to how it handles commands on the MLQUEUE obviously to match the other modifications). And, obviously, all of this will have to be a coordinated patch that covers both old and new eh style drivers ;-) I am not able to access #106146 which this bug is supposed to be a duplicate of. Could someone give me access? Changed to 'CLOSED' state since 'RESOLVED' has been deprecated. |
From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Description of problem: This problem is described in LTC bugzilla 3867. To resolve the problem a common code patch for severe SCSI I/O stall problem was sent to RH for review and integration. Please contact Pete Zaitcev for more information Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1.. 2. 3. Additional info: