Description of problem: The current RHEL4 scsi implementation will return I/Os that fail with a sense key of 0xB and ASC/ASCQ of 0x44/0x0 (Aborted Command - internal target failure) to the device-mapper multipath target immediately without any retries at the SCSI layer due to multipath's use of BIO_RW_FAILFAST. This causes multipath to mark the path as failed and perform a path group switch, retrying the I/O down a different path. The failed path will then usually be scheduled for a check on the next polling interval and reinstated assuming the condition on the target has been cleared (this sense buffer is commonly seen with transient errors on storage arrays). Version-Release number of selected component (if applicable): 2.6.9-*.EL How reproducible: 100% given the right storage behaviour but a bit tricky to generate on demand Steps to Reproduce: 1. Induce a situation on the storage controller that will cause I/Os to be failed with a sense buffer of Aborted Command - internal target failure. E.g. this has been seen frequently with EMC Symmetrix SRDF LUNs where the R1 will sporadically spit these errors out when brief changes in the SRDF link status happen. 2. Observe SCSI errors logged to dmesg Actual results: SCSI errors logged, multipath marks path as failed Expected results: No SCSI error logged unless midlayer retry count / timeouts exceeded. Multipath does not mark path as failed Additional info: This behaviour creates similar undesirable path switching as the transport/framing error cases we recently converted to DO_IMM_RETRY in qla2xxx (DID_TRANSPORT_DISRUPTED in RHEL5): bug 490744 [RHEL4] bug 244967 [RHEL5]
Created attachment 347276 [details] don't failfast dev errors Adding devel ack and patch. This solves the problem if the scsi layer fast failing this so we match upstream and RHEL5 behavior.
Thanks Mike. I'll get a test build of this done today & out for testing. Don't expect any problems but will make sure that this gets a run in the environment that's currently seeing these problems.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 89.8.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A bug in the SCSI implementation caused "Aborted Command - internal target failure" errors to be sent to Device-Mapper Multipath, without retries, resulting in Device-Mapper Multipath marking the path as failed and making a path group switch. With this update, all errors that return a sense key in the SCSI mid layer (including "Aborted Command - internal target failure") are retried.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html