741273 – Non-responsive scsi target leads to excessive scsi recovery and dm-mp failover time [rhel-5.7.z]

Bug 741273 - Non-responsive scsi target leads to excessive scsi recovery and dm-mp failover time [rhel-5.7.z]

Summary: Non-responsive scsi target leads to excessive scsi recovery and dm-mp failove...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Phillip Lougher
QA Contact:	Gris Ge
Docs Contact:
URL:
Whiteboard:
Depends On:	694625
Blocks:
TreeView+	depends on / blocked

Reported:	2011-09-26 13:19 UTC by RHEL Program Management
Modified:	2013-01-11 04:03 UTC (History)
CC List:	12 users (show)
Fixed In Version:	kernel-2.6.18-274.11.1.el5
Doc Type:	Bug Fix
Doc Text:	In error recovery, most SCSI error recovery stages send a TUR (Test Unit Ready) command for every bad command when a driver error handler reports success. When several bad commands pointed to a same device, the device was probed multiple times. When the device was in a state where it did not respond to commands even after a recovery function returned success, the error handler had to wait for the commands to time out. This significantly impeded the recovery process. With this update, SCSI mid-layer error routines to send test commands have been fixed to respond once per device instead of once per bad command, thus reducing error recovery time considerably.
Clone Of:
Environment:
Last Closed:	2011-11-29 14:36:25 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1479	0	normal	SHIPPED_LIVE	Important: kernel security, bug fix, and enhancement update	2011-11-29 19:25:05 UTC

Description RHEL Program Management 2011-09-26 13:19:08 UTC

This bug has been copied from bug #694625 and has been proposed
to be backported to 5.7 z-stream (EUS).

Comment 4 Phillip Lougher 2011-11-08 16:38:25 UTC

in kernel-2.6.18-274.11.1.el5

linux-2.6-scsi-reduce-error-recovery-time-by-reducing-use-of-turs.patch

Comment 6 Gris Ge 2011-11-23 10:26:23 UTC

Cannot reproduce this problem on RHEL 5.7 GA kernel.

scsi_debug downloaded from http://lacrosse.corp.redhat.com/~fge/scsi_debug/
=======
[root@intel-canoepass-02 ~]# modprobe scsi_debug dev_size_mb=100 opts=4 
[root@intel-canoepass-02 ~]# date
Wed Nov 23 05:03:52 EST 2011
[root@intel-canoepass-02 ~]# echo -1 >
/sys/module/scsi_debug/parameters/every_nth
[root@intel-canoepass-02 ~]# dd if=/dev/sdb of=/dev/null count=1 iflag=direct

dd: reading `/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 120.01 seconds, 0.0 kB/s
[root@intel-canoepass-02 ~]# date
Wed Nov 23 05:05:52 EST 2011
========

Got same results on kernel-2.6.18-274.11.1.el5.


Code reviewed, patch found in kernel-2.6.18-274.11.1.el5

Comment 7 errata-xmlrpc 2011-11-29 14:36:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1479.html

Comment 8 Martin Prpič 2011-11-29 17:58:49 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In error recovery, most SCSI error recovery stages send a TUR (Test Unit Ready) command for every bad command when a driver error handler reports success. When several bad commands pointed to a same device, the device was probed multiple times. When the device was in a state where it did not respond to commands even after a recovery function returned success, the error handler had to wait for the commands to time out. This significantly impeded the recovery process. With this update, SCSI mid-layer error routines to send test commands have been fixed to respond once per device instead of once per bad command, thus reducing error recovery time considerably.

Note You need to log in before you can comment on or make changes to this bug.