Bug 505123

Summary: Make Aborted Command (internal target failure) retryable at SCSI layer (sense B 44 00)
Product: Red Hat Enterprise Linux 5 Reporter: Bryn M. Reeves <bmr>
Component: kernelAssignee: Mike Christie <mchristi>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.5   
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 505122 Environment:
Last Closed: 2009-06-11 09:37:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Bryn M. Reeves 2009-06-10 18:09:18 UTC
+++ This bug was initially created as a clone of Bug #505122 +++

Description of problem:
The current RHEL5 scsi implementation will return I/Os that fail with a sense key of 0xB and ASC/ASCQ of 0x44/0x0 (Aborted Command - internal target failure) to the device-mapper multipath target immediately without any retries at the SCSI layer due to multipath's use of BIO_RW_FAILFAST.

This causes multipath to mark the path as failed and perform a path group switch, retrying the I/O down a different path. The failed path will then usually be scheduled for a check on the next polling interval and reinstated assuming the condition on the target has been cleared (this sense buffer is commonly seen with transient errors on storage arrays).

Version-Release number of selected component (if applicable):
2.6.18-*.EL

How reproducible:
100% given the right storage behaviour but a bit tricky to generate on demand

Steps to Reproduce:
1. Induce a situation on the storage controller that will cause I/Os to be failed with a sense buffer of Aborted Command - internal target failure. E.g. this has been seen frequently with EMC Symmetrix SRDF LUNs where the R1 will sporadically spit these errors out when brief changes in the SRDF link status happen.

2. Observe SCSI errors logged to dmesg

  
Actual results:
SCSI errors logged, multipath marks path as failed

Expected results:
No SCSI error logged unless midlayer retry count / timeouts exceeded. Multipath does not mark path as failed

Additional info:
This behaviour creates similar undesirable path switching as the transport/framing error cases we recently converted to DO_IMM_RETRY in qla2xxx (DID_TRANSPORT_DISRUPTED in RHEL5):

    bug 490744 [RHEL4]
    bug 244967 [RHEL5]

Comment 3 Mike Christie 2009-06-10 19:09:44 UTC
This should be fixed in RHEL5.3.

For ABORTED COMMAND (or any sense error really), the scsi layer should be retrying 5 times or retrying for up to 5 * cmd->timeout (default timeout is 60 secs for R/W IO) like is done if multipath is not used. After that the scsi layer will fail the IO upwards.

At that time dm-multipath can fail the path. For the multipath fix we need the blk error codes in here
https://bugzilla.redhat.com/show_bug.cgi?id=504799.


RHEL4 should need a change to the scsi layer so it retries sense errors instead of failfasting them. I attached a patch for it in the other bz.

Comment 4 Bryn M. Reeves 2009-06-11 09:37:21 UTC
Duh, thanks Mike - sorry for the noise.

I'll close this one as a duplicate of 447586

*** This bug has been marked as a duplicate of bug 447586 ***