Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 691945

Summary:

Non-responsive scsi target leads to excessive scsi recovery and dm-mp failover time

Product:

Red Hat Enterprise Linux 6

Reporter:

Dave Wysochanski <dwysocha>

Component:

kernel

Assignee:

Mike Christie <mchristi>

Status:

CLOSED ERRATA

QA Contact:

Storage QE <storage-qe>

Severity:

high

Docs Contact:

Priority:

medium

Version:

6.0

CC:

akarlsso, amark, bdonahue, bubrown, dhoward, djeffery, dwysocha, fge, kzhang, mgoodwin, plyons, rdassen, soft-linux-drv

Target Milestone:

Keywords:

Reopened, ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

kernel-2.6.32-171.el6

Doc Type:

Bug Fix

Doc Text:

In error recovery, most SCSI error recovery stages send a TUR (Test Unit Ready) command for every bad command when a driver error handler reports success. When several bad commands pointed to a same device, the device was probed multiple times. When the device was in a state where the device did not respond to commands even after a recovery function returned success, the error handler had to wait for the commands to time out. This significantly impeded the recovery process. With this update, SCSI mid-layer error routines to send test commands have been fixed to respond once per device instead of once per bad command, thus reducing error recovery time considerably.

Story Points:

---

Clone Of:

Clones:

694625 (view as bug list)

Environment:

Last Closed:

2011-12-06 12:47:24 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

672437, 694625, 744811, 767187, 833603, 846704, 848463

Attachments:

Description	Flags
Reduce # of turs sent during scsi error recovery	none

Description Dave Wysochanski 2011-03-29 22:05:59 UTC

A non-responsive scsi target that "black hole"s commands leads to full scsi recovery logic, which includes sending TURs at various stages and waiting for timeouts.  This process may take a long time, and it blocks dm-mp failover for multiple minutes, which may easily exceed an important application timeout (such as an Oracle voting disk).

In a typical scenario, the net-net is a customer who goes through the expense of implementing a dual-fabric, fully redundant storage environment sees the environment not failover as expected in the case of such a "black hole" or "slow draining" target, and thus, the expensive redundancy they paid for does not work.

Steps to Reproduce:
Configure a scsi target setup so that commands are silently dropped, but no transport or other error is reported (all commands timeout).  One way to do this is with scsi_debug.
  
Actual results:
scsi recovery logic can take many minutes to complete, which blocks dm-mp from failing over to another path, and renders expensive dual fabric setups ineffective.

Expected results:
scsi error recovery, and thus, dm-mp, should complete in a reasonable amount of time so that application timeouts (such as oracle voting disk) do not get triggered if the "black hole" target type failure occurs.

Additional info:
David Jeffery has started a patchset which will address this issue and posted it to linux-scsi: http://www.spinics.net/lists/linux-scsi/msg51090.html

Comment 1 David Jeffery 2011-03-30 21:30:25 UTC

Created attachment 488905 [details]
Reduce # of turs sent during scsi error recovery

Attached is a RHEL6 version of the patch which as be submitted (but not yet accepted) upstream.

Comment 2 Dave Wysochanski 2011-04-07 18:23:22 UTC


*** This bug has been marked as a duplicate of bug 672437 ***

Comment 3 Dave Wysochanski 2011-04-07 19:17:45 UTC

Re-opening to track this specific case separately, as BZ 672437 is more general.

Comment 4 RHEL Program Management 2011-05-13 15:24:04 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 6 Mike Christie 2011-07-14 05:44:48 UTC

Thanks for your work on this David. Patch was sent to rh-kernel for review and merging.

Comment 8 Kyle McMartin 2011-07-25 13:06:50 UTC

Patch(es) available on kernel-2.6.32-171.el6

Comment 12 Tomas Capek 2011-11-23 15:35:16 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In error recovery, most SCSI error recovery stages send a TUR (Test Unit Ready) command for every bad command when a driver error handler reports success. When several bad commands pointed to a same device, the device was probed multiple times. When the device was in a state where the device did not respond to commands even after a recovery function returned success, the error handler had to wait for the commands to time out. This significantly impeded the recovery process. With this update, SCSI mid-layer error routines to send test commands have been fixed to respond once per device instead of once per bad command, thus reducing error recovery time considerably.

Comment 13 errata-xmlrpc 2011-12-06 12:47:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html

Comment 14 Rob Evers 2013-03-19 18:21:41 UTC

*** Bug 631765 has been marked as a duplicate of this bug. ***