Bug 619375

Summary: [NetApp 6.1 bug] SCSI ALUA handler fails to handle ALUA transitioning properly
Product: Red Hat Enterprise Linux 6 Reporter: Martin George <marting>
Component: kernelAssignee: Mike Snitzer <msnitzer>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: andriusb, bmarzins, coughlan, dhoward, msnitzer, xdl-redhat-bugzilla
Target Milestone: rcKeywords: OtherQA
Target Release: 6.1   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 619361 Environment:
Last Closed: 2010-10-13 13:20:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 619361    
Bug Blocks:    

Description Martin George 2010-07-29 12:24:20 UTC
+++ This bug was initially created as a clone of Bug #619361 +++

Description of problem:
The SCSI ALUA handler does not handle ALUA transitioning states properly. For eg. for an ALUA enabled NetApp controller which supports implicit ALUA alone (with the following valid states - port group 00 state A supports ToUsNA), the following code snippet is seen in alua_rtpg:

if (h->tpgs & TPGS_MODE_EXPLICIT) {
        switch (h->state) {
        case TPGS_STATE_TRANSITIONING:
                /* State transition, retry */
                goto retry;
                break;
        case TPGS_STATE_OFFLINE:
                /* Path is offline, fail */
                err = SCSI_DH_DEV_OFFLINED;
                break;
        default:
                break;
        }
} else {
        /* Only Implicit ALUA support */
        if (h->state == TPGS_STATE_OPTIMIZED ||
            h->state == TPGS_STATE_NONOPTIMIZED ||
            h->state == TPGS_STATE_STANDBY)
                /* Useable path if active */
                err = SCSI_DH_OK;
        else
                /* Path unuseable for unavailable/offline */
                err = SCSI_DH_DEV_OFFLINED;
}

During NetApp controller faults, the lun is in 'transitioning' state. But from the above code, it seems this is handled for explicit ALUA alone, and not for implicit ALUA. It should have ideally handled this for both.

Secondly, in the alua_prep_fn:

if (h->state != TPGS_STATE_OPTIMIZED && h->state != TPGS_STATE_NONOPTIMIZED) {
        ret = BLKPREP_KILL;
        req->flags |= REQ_QUIET;
}

Why is TPGS_STATE_TRANSITIONING not handled above? For this state, I suppose the prep_fn should be returning BLKPREP_DEFER.

Because of these issues with the ALUA handler, we seem to have hit delayed dm-multipath IO (on SCSI devices using the ALUA handler) as described in bug 606259. 

Version-Release number of selected component (if applicable):
2.6.32-30.el6 (RHEL 6.0 Snap6)

Comment 2 RHEL Program Management 2010-07-29 12:47:41 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 3 Andrius Benokraitis 2010-07-29 21:02:34 UTC
This issue was cloned from RHEL 5.6 work, which is currently blocking NetApp's certification of RHEL 5.5. If this issue is also in RHEL 6.0, it will block NetApp's certification of 6.0 as well.

Comment 5 Andrius Benokraitis 2010-08-12 02:13:20 UTC
Tentatively deferring this to RHEL 6.1 based on where we are in the 6.0 dev cycle. Risk sounds high. This will still be a candidate for an early 6.0.z if a fix is made available.

Comment 8 Andrius Benokraitis 2010-09-13 14:28:06 UTC
Proposed for RHEL 6.0.z per NetApp's request and justification.

Comment 10 Andrius Benokraitis 2010-10-13 13:20:11 UTC

*** This bug has been marked as a duplicate of bug 636994 ***