Red Hat Bugzilla – Bug 505123
Make Aborted Command (internal target failure) retryable at SCSI layer (sense B 44 00)
Last modified: 2009-06-11 05:37:44 EDT
+++ This bug was initially created as a clone of Bug #505122 +++
Description of problem:
The current RHEL5 scsi implementation will return I/Os that fail with a sense key of 0xB and ASC/ASCQ of 0x44/0x0 (Aborted Command - internal target failure) to the device-mapper multipath target immediately without any retries at the SCSI layer due to multipath's use of BIO_RW_FAILFAST.
This causes multipath to mark the path as failed and perform a path group switch, retrying the I/O down a different path. The failed path will then usually be scheduled for a check on the next polling interval and reinstated assuming the condition on the target has been cleared (this sense buffer is commonly seen with transient errors on storage arrays).
Version-Release number of selected component (if applicable):
100% given the right storage behaviour but a bit tricky to generate on demand
Steps to Reproduce:
1. Induce a situation on the storage controller that will cause I/Os to be failed with a sense buffer of Aborted Command - internal target failure. E.g. this has been seen frequently with EMC Symmetrix SRDF LUNs where the R1 will sporadically spit these errors out when brief changes in the SRDF link status happen.
2. Observe SCSI errors logged to dmesg
SCSI errors logged, multipath marks path as failed
No SCSI error logged unless midlayer retry count / timeouts exceeded. Multipath does not mark path as failed
This behaviour creates similar undesirable path switching as the transport/framing error cases we recently converted to DO_IMM_RETRY in qla2xxx (DID_TRANSPORT_DISRUPTED in RHEL5):
bug 490744 [RHEL4]
bug 244967 [RHEL5]
This should be fixed in RHEL5.3.
For ABORTED COMMAND (or any sense error really), the scsi layer should be retrying 5 times or retrying for up to 5 * cmd->timeout (default timeout is 60 secs for R/W IO) like is done if multipath is not used. After that the scsi layer will fail the IO upwards.
At that time dm-multipath can fail the path. For the multipath fix we need the blk error codes in here
RHEL4 should need a change to the scsi layer so it retries sense errors instead of failfasting them. I attached a patch for it in the other bz.
Duh, thanks Mike - sorry for the noise.
I'll close this one as a duplicate of 447586
*** This bug has been marked as a duplicate of bug 447586 ***