Bug 505122 - Make Aborted Command (internal target failure) retryable at SCSI layer (sense B 44 00)
Summary: Make Aborted Command (internal target failure) retryable at SCSI layer (sense...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.9
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Mike Christie
QA Contact: Evan McNabb
URL:
Whiteboard:
Depends On:
Blocks: 514007
TreeView+ depends on / blocked
 
Reported: 2009-06-10 18:04 UTC by Bryn M. Reeves
Modified: 2018-10-20 01:57 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
A bug in the SCSI implementation caused "Aborted Command - internal target failure" errors to be sent to Device-Mapper Multipath, without retries, resulting in Device-Mapper Multipath marking the path as failed and making a path group switch. With this update, all errors that return a sense key in the SCSI mid layer (including "Aborted Command - internal target failure") are retried.
Clone Of:
: 505123 (view as bug list)
Environment:
Last Closed: 2011-02-16 15:21:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
don't failfast dev errors (528 bytes, application/octet-stream)
2009-06-10 18:34 UTC, Mike Christie
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0263 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update 2011-02-16 15:14:55 UTC

Description Bryn M. Reeves 2009-06-10 18:04:57 UTC
Description of problem:
The current RHEL4 scsi implementation will return I/Os that fail with a sense key of 0xB and ASC/ASCQ of 0x44/0x0 (Aborted Command - internal target failure) to the device-mapper multipath target immediately without any retries at the SCSI layer due to multipath's use of BIO_RW_FAILFAST.

This causes multipath to mark the path as failed and perform a path group switch, retrying the I/O down a different path. The failed path will then usually be scheduled for a check on the next polling interval and reinstated assuming the condition on the target has been cleared (this sense buffer is commonly seen with transient errors on storage arrays).

Version-Release number of selected component (if applicable):
2.6.9-*.EL

How reproducible:
100% given the right storage behaviour but a bit tricky to generate on demand

Steps to Reproduce:
1. Induce a situation on the storage controller that will cause I/Os to be failed with a sense buffer of Aborted Command - internal target failure. E.g. this has been seen frequently with EMC Symmetrix SRDF LUNs where the R1 will sporadically spit these errors out when brief changes in the SRDF link status happen.

2. Observe SCSI errors logged to dmesg

  
Actual results:
SCSI errors logged, multipath marks path as failed

Expected results:
No SCSI error logged unless midlayer retry count / timeouts exceeded. Multipath does not mark path as failed

Additional info:
This behaviour creates similar undesirable path switching as the transport/framing error cases we recently converted to DO_IMM_RETRY in qla2xxx (DID_TRANSPORT_DISRUPTED in RHEL5):

    bug 490744 [RHEL4]
    bug 244967 [RHEL5]

Comment 2 Mike Christie 2009-06-10 18:34:27 UTC
Created attachment 347276 [details]
don't failfast dev errors

Adding devel ack and patch.

This solves the problem if the scsi layer fast failing this so we match upstream and RHEL5 behavior.

Comment 3 Bryn M. Reeves 2009-06-11 13:40:12 UTC
Thanks Mike. I'll get a test build of this done today & out for testing. Don't expect any problems but will make sure that this gets a run in the environment that's currently seeing these problems.

Comment 8 RHEL Program Management 2009-07-21 18:33:48 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 15 Vivek Goyal 2009-08-04 13:37:06 UTC
Committed in 89.8.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 21 Douglas Silas 2011-01-30 23:47:13 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A bug in the SCSI implementation caused "Aborted Command - internal target failure" errors to be sent to Device-Mapper Multipath, without retries, resulting in Device-Mapper Multipath marking the path as failed and making a path group switch. With this update, all errors that return a sense key in the SCSI mid layer (including "Aborted Command - internal target failure") are retried.

Comment 22 errata-xmlrpc 2011-02-16 15:21:01 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html


Note You need to log in before you can comment on or make changes to this bug.