Bug 694625

Summary:	Non-responsive scsi target leads to excessive scsi recovery and dm-mp failover time
Product:	Red Hat Enterprise Linux 5	Reporter:	Dave Wysochanski <dwysocha>
Component:	kernel	Assignee:	Mike Christie <mchristi>
Status:	CLOSED ERRATA	QA Contact:	Gris Ge <fge>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.5	CC:	amark, anton, bubrown, ccui, coughlan, ctatman, dhoward, djeffery, dwysocha, fge, ichute, marting, mchristi, mgoodwin, plyons, rdassen, revers, xdl-redhat-bugzilla
Target Milestone:	rc	Keywords:	Reopened, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-2.6.18-287.el5	Doc Type:	Bug Fix
Doc Text:	In error recovery, most SCSI error recovery stages send a TUR (Test Unit Ready) command for every bad command when a driver error handler reports success. When several bad commands pointed to a same device, the device was probed multiple times. When the device was in a state where it did not respond to commands even after a recovery function returned success, the error handler had to wait for the commands to time out. This significantly impeded the recovery process. With this update, SCSI mid-layer error routines to send test commands have been fixed to respond once per device instead of once per bad command, thus reducing error recovery time considerably.	Story Points:	---
Clone Of:	691945	Environment:
Last Closed:	2012-02-21 03:34:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	691945
Bug Blocks:	741273

Description Dave Wysochanski 2011-04-07 19:22:57 UTC

+++ This bug was initially created as a clone of Bug #691945 +++

A non-responsive scsi target that "black hole"s commands leads to full scsi recovery logic, which includes sending TURs at various stages and waiting for timeouts.  This process may take a long time, and it blocks dm-mp failover for multiple minutes, which may easily exceed an important application timeout (such as an Oracle voting disk).

In a typical scenario, the net-net is a customer who goes through the expense of implementing a dual-fabric, fully redundant storage environment sees the environment not failover as expected in the case of such a "black hole" or "slow draining" target, and thus, the expensive redundancy they paid for does not work.

Steps to Reproduce:
Configure a scsi target setup so that commands are silently dropped, but no transport or other error is reported (all commands timeout).  One way to do this is with scsi_debug.
  
Actual results:
scsi recovery logic can take many minutes to complete, which blocks dm-mp from failing over to another path, and renders expensive dual fabric setups ineffective.

Expected results:
scsi error recovery, and thus, dm-mp, should complete in a reasonable amount of time so that application timeouts (such as oracle voting disk) do not get triggered if the "black hole" target type failure occurs.

Additional info:
David Jeffery has started a patchset which will address this issue and posted it to linux-scsi: http://www.spinics.net/lists/linux-scsi/msg51090.html

--- Additional comment from djeffery on 2011-03-30 17:30:25 EDT ---

Created attachment 488905 [details]
Reduce # of turs sent during scsi error recovery

Attached is a RHEL6 version of the patch which as be submitted (but not yet accepted) upstream.

Comment 1 RHEL Program Management 2011-06-20 22:39:19 UTC

This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 2 RHEL Program Management 2011-08-09 21:59:29 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 Jarod Wilson 2011-10-10 21:46:11 UTC

Patch(es) available in kernel-2.6.18-287.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 9 Jarod Wilson 2011-10-10 21:49:29 UTC

Patch(es) available in kernel-2.6.18-287.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 11 Gris Ge 2011-11-23 10:12:28 UTC

Dave,

I got no difference after this patch on kernel -298

scsi_debug downloaded from http://lacrosse.corp.redhat.com/~fge/scsi_debug/
=======
[root@intel-canoepass-02 ~]# modprobe scsi_debug dev_size_mb=100 opts=4 
[root@intel-canoepass-02 ~]# date
Wed Nov 23 05:03:52 EST 2011
[root@intel-canoepass-02 ~]# echo -1 > /sys/module/scsi_debug/parameters/every_nth
[root@intel-canoepass-02 ~]# dd if=/dev/sdb of=/dev/null count=1 iflag=direct

dd: reading `/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 120.01 seconds, 0.0 kB/s
[root@intel-canoepass-02 ~]# date
Wed Nov 23 05:05:52 EST 2011
========

Comment 12 Mike Christie 2011-11-23 19:48:11 UTC

(In reply to comment #11)
> Dave,
> 
> I got no difference after this patch on kernel -298
> 
> scsi_debug downloaded from http://lacrosse.corp.redhat.com/~fge/scsi_debug/
> =======
> [root@intel-canoepass-02 ~]# modprobe scsi_debug dev_size_mb=100 opts=4 
> [root@intel-canoepass-02 ~]# date
> Wed Nov 23 05:03:52 EST 2011
> [root@intel-canoepass-02 ~]# echo -1 >
> /sys/module/scsi_debug/parameters/every_nth

Every 1 might be too aggressive. Try something like every 4th or 5th or even 2.

Comment 13 Martin Prpič 2011-11-29 17:59:01 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In error recovery, most SCSI error recovery stages send a TUR (Test Unit Ready) command for every bad command when a driver error handler reports success. When several bad commands pointed to a same device, the device was probed multiple times. When the device was in a state where it did not respond to commands even after a recovery function returned success, the error handler had to wait for the commands to time out. This significantly impeded the recovery process. With this update, SCSI mid-layer error routines to send test commands have been fixed to respond once per device instead of once per bad command, thus reducing error recovery time considerably.

Comment 14 Martin George 2012-01-06 09:05:29 UTC

Chris,

I see the ZStream keyword added here. So is this already queued for 5.7.z?

Comment 15 Chris Tatman 2012-01-06 16:54:42 UTC

The errata announcement for the fix is here:

http://rhn.redhat.com/errata/RHSA-2011-1479.html

Comment 16 Gris Ge 2012-01-29 03:25:44 UTC

Mike Christie,

I still got 6 minutes timeout after change every_nth=2

Both kernel -274 and kernel -305 got 6 minutes timeout.

===
[root@RHEL5-Stable ~]# uname -r
2.6.18-305.el5
[root@RHEL5-Stable ~]# modprobe scsi_debug dev_size_mb=100 opts=4
[root@RHEL5-Stable ~]# sleep 1
[root@RHEL5-Stable ~]# date
Sun Jan 29 11:00:45 CST 2012
[root@RHEL5-Stable ~]# echo 2 > /sys/module/scsi_debug/parameters/every_nth
[root@RHEL5-Stable ~]# dd if=/dev/sdb of=/dev/null count=1 iflag=direct
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000649 seconds, 789 kB/s
[root@RHEL5-Stable ~]# dd if=/dev/sdb of=/dev/null count=1 iflag=direct
dd: reading `/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 359.967 seconds, 0.0 kB/s
[root@RHEL5-Stable ~]# date
Sun Jan 29 11:06:45 CST 2012
===

The dmesg show 6 minutes is by design:
===
sd 1:0:0:0: timing out command, waited 360s
sd 1:0:0:0: Unhandled error code
sd 1:0:0:0: SCSI error: return code = 0x06000000
Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
===

Am I use the incorrect way to test this bug?
I assume, multipath is use (mostly) directio to check link status, so I choose dd to verify the timeout.

Comment 17 Mike Christie 2012-01-30 23:11:33 UTC

Ah you know what, I do not think we will be able to use scsi_debug to test this.

What you hit was the initial scsi command time out (/sys/block/sdX/device/timeout).

What we changed is the behavior after that. For that we would need a way to control how the IO sent to fix up that problem is failed.

When I made/tested the patch I had to actually modify the code to simulate what people were seeing.

Comment 18 Gris Ge 2012-01-31 01:58:27 UTC

Thanks for the detail.

All my multipath FC failover tests in RHEL 5.8 show no regression, I will set this bug as Sanity Only.

Comment 19 errata-xmlrpc 2012-02-21 03:34:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html