505123 – Make Aborted Command (internal target failure) retryable at SCSI layer (sense B 44 00)

Bug 505123 - Make Aborted Command (internal target failure) retryable at SCSI layer (sense B 44 00)

Summary: Make Aborted Command (internal target failure) retryable at SCSI layer (sense...

Keywords:
Status:	CLOSED DUPLICATE of bug 447586
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Mike Christie
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-10 18:09 UTC by Bryn M. Reeves
Modified:	2009-06-11 09:37 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	505122
Environment:
Last Closed:	2009-06-11 09:37:21 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Bryn M. Reeves 2009-06-10 18:09:18 UTC

+++ This bug was initially created as a clone of Bug #505122 +++

Description of problem:
The current RHEL5 scsi implementation will return I/Os that fail with a sense key of 0xB and ASC/ASCQ of 0x44/0x0 (Aborted Command - internal target failure) to the device-mapper multipath target immediately without any retries at the SCSI layer due to multipath's use of BIO_RW_FAILFAST.

This causes multipath to mark the path as failed and perform a path group switch, retrying the I/O down a different path. The failed path will then usually be scheduled for a check on the next polling interval and reinstated assuming the condition on the target has been cleared (this sense buffer is commonly seen with transient errors on storage arrays).

Version-Release number of selected component (if applicable):
2.6.18-*.EL

How reproducible:
100% given the right storage behaviour but a bit tricky to generate on demand

Steps to Reproduce:
1. Induce a situation on the storage controller that will cause I/Os to be failed with a sense buffer of Aborted Command - internal target failure. E.g. this has been seen frequently with EMC Symmetrix SRDF LUNs where the R1 will sporadically spit these errors out when brief changes in the SRDF link status happen.

2. Observe SCSI errors logged to dmesg

  
Actual results:
SCSI errors logged, multipath marks path as failed

Expected results:
No SCSI error logged unless midlayer retry count / timeouts exceeded. Multipath does not mark path as failed

Additional info:
This behaviour creates similar undesirable path switching as the transport/framing error cases we recently converted to DO_IMM_RETRY in qla2xxx (DID_TRANSPORT_DISRUPTED in RHEL5):

    bug 490744 [RHEL4]
    bug 244967 [RHEL5]

Comment 3 Mike Christie 2009-06-10 19:09:44 UTC

This should be fixed in RHEL5.3.

For ABORTED COMMAND (or any sense error really), the scsi layer should be retrying 5 times or retrying for up to 5 * cmd->timeout (default timeout is 60 secs for R/W IO) like is done if multipath is not used. After that the scsi layer will fail the IO upwards.

At that time dm-multipath can fail the path. For the multipath fix we need the blk error codes in here
https://bugzilla.redhat.com/show_bug.cgi?id=504799.


RHEL4 should need a change to the scsi layer so it retries sense errors instead of failfasting them. I attached a patch for it in the other bz.

Comment 4 Bryn M. Reeves 2009-06-11 09:37:21 UTC

Duh, thanks Mike - sorry for the noise.

I'll close this one as a duplicate of 447586

*** This bug has been marked as a duplicate of bug 447586 ***

Note You need to log in before you can comment on or make changes to this bug.