Bug 741273

Summary: Non-responsive scsi target leads to excessive scsi recovery and dm-mp failover time [rhel-5.7.z]
Product: Red Hat Enterprise Linux 5 Reporter: RHEL Program Management <pm-rhel>
Component: kernelAssignee: Phillip Lougher <plougher>
Status: CLOSED ERRATA QA Contact: Gris Ge <fge>
Severity: high Docs Contact:
Priority: high    
Version: 5.5CC: amark, anton, bubrown, ccui, dhoward, djeffery, dwysocha, fge, mchristi, mgoodwin, plyons, pm-eus
Target Milestone: rcKeywords: Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-2.6.18-274.11.1.el5 Doc Type: Bug Fix
Doc Text:
In error recovery, most SCSI error recovery stages send a TUR (Test Unit Ready) command for every bad command when a driver error handler reports success. When several bad commands pointed to a same device, the device was probed multiple times. When the device was in a state where it did not respond to commands even after a recovery function returned success, the error handler had to wait for the commands to time out. This significantly impeded the recovery process. With this update, SCSI mid-layer error routines to send test commands have been fixed to respond once per device instead of once per bad command, thus reducing error recovery time considerably.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-11-29 14:36:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 694625    
Bug Blocks:    

Description RHEL Program Management 2011-09-26 13:19:08 UTC
This bug has been copied from bug #694625 and has been proposed
to be backported to 5.7 z-stream (EUS).

Comment 4 Phillip Lougher 2011-11-08 16:38:25 UTC
in kernel-2.6.18-274.11.1.el5

linux-2.6-scsi-reduce-error-recovery-time-by-reducing-use-of-turs.patch

Comment 6 Gris Ge 2011-11-23 10:26:23 UTC
Cannot reproduce this problem on RHEL 5.7 GA kernel.

scsi_debug downloaded from http://lacrosse.corp.redhat.com/~fge/scsi_debug/
=======
[root@intel-canoepass-02 ~]# modprobe scsi_debug dev_size_mb=100 opts=4 
[root@intel-canoepass-02 ~]# date
Wed Nov 23 05:03:52 EST 2011
[root@intel-canoepass-02 ~]# echo -1 >
/sys/module/scsi_debug/parameters/every_nth
[root@intel-canoepass-02 ~]# dd if=/dev/sdb of=/dev/null count=1 iflag=direct

dd: reading `/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 120.01 seconds, 0.0 kB/s
[root@intel-canoepass-02 ~]# date
Wed Nov 23 05:05:52 EST 2011
========

Got same results on kernel-2.6.18-274.11.1.el5.


Code reviewed, patch found in kernel-2.6.18-274.11.1.el5

Comment 7 errata-xmlrpc 2011-11-29 14:36:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1479.html

Comment 8 Martin Prpič 2011-11-29 17:58:49 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In error recovery, most SCSI error recovery stages send a TUR (Test Unit Ready) command for every bad command when a driver error handler reports success. When several bad commands pointed to a same device, the device was probed multiple times. When the device was in a state where it did not respond to commands even after a recovery function returned success, the error handler had to wait for the commands to time out. This significantly impeded the recovery process. With this update, SCSI mid-layer error routines to send test commands have been fixed to respond once per device instead of once per bad command, thus reducing error recovery time considerably.