Bug 619361 - [NetApp 5.6 bug] SCSI ALUA handler fails to handle ALUA transitioning properly
Summary: [NetApp 5.6 bug] SCSI ALUA handler fails to handle ALUA transitioning properly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
high
urgent
Target Milestone: rc
: 5.6
Assignee: Mike Snitzer
QA Contact: Storage QE
URL:
Whiteboard:
: 606259 (view as bug list)
Depends On:
Blocks: 557597 619375 657028
TreeView+ depends on / blocked
 
Reported: 2010-07-29 11:49 UTC by Martin George
Modified: 2018-11-14 15:58 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 619375 636994 (view as bug list)
Environment:
Last Closed: 2011-01-13 21:46:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Martin George 2010-07-29 11:49:45 UTC
Description of problem:
The SCSI ALUA handler does not handle ALUA transitioning states properly. For eg. for an ALUA enabled NetApp controller which supports implicit ALUA alone (with the following valid states - port group 00 state A supports ToUsNA), the following code snippet is seen in alua_rtpg:

if (h->tpgs & TPGS_MODE_EXPLICIT) {
        switch (h->state) {
        case TPGS_STATE_TRANSITIONING:
                /* State transition, retry */
                goto retry;
                break;
        case TPGS_STATE_OFFLINE:
                /* Path is offline, fail */
                err = SCSI_DH_DEV_OFFLINED;
                break;
        default:
                break;
        }
} else {
        /* Only Implicit ALUA support */
        if (h->state == TPGS_STATE_OPTIMIZED ||
            h->state == TPGS_STATE_NONOPTIMIZED ||
            h->state == TPGS_STATE_STANDBY)
                /* Useable path if active */
                err = SCSI_DH_OK;
        else
                /* Path unuseable for unavailable/offline */
                err = SCSI_DH_DEV_OFFLINED;
}

During NetApp controller faults, the lun is in 'transitioning' state. But from the above code, it seems this is handled for explicit ALUA alone, and not for implicit ALUA. It should have ideally handled this for both.

Secondly, in the alua_prep_fn:

if (h->state != TPGS_STATE_OPTIMIZED && h->state != TPGS_STATE_NONOPTIMIZED) {
        ret = BLKPREP_KILL;
        req->flags |= REQ_QUIET;
}

Why is TPGS_STATE_TRANSITIONING not handled above? For this state, I suppose the prep_fn should be returning BLKPREP_DEFER.

Because of these issues with the ALUA handler, we seem to have hit delayed dm-multipath IO (on SCSI devices using the ALUA handler) as described in bug 606259. 

Version-Release number of selected component (if applicable):
kernel-2.6.18-194.el5 (RHEL 5.5)

Comment 1 Mike Snitzer 2010-07-29 21:29:19 UTC
The "log messages" attached to bug#606259 only ever show alua_rtpg() logging of the form: port group 00 state A supports ToUsNA

These "ToUsNA" flags map to the following supported states:
TPGS_SUPPORT_TRANSITION
TPGS_SUPPORT_UNAVAILABLE
TPGS_SUPPORT_NONOPTIMIZED
TPGS_SUPPORT_OPTIMIZED

comment#0 shows the block of code that handles these states for explicit and implicit alua.

I agree that alua_rtpg() clearly lacks implicit alua support for TPGS_STATE_TRANSITIONING (which this NetApp LUN clearly needs given TPGS_SUPPORT_TRANSITION).

But it strikes me as odd that we don't see something like the following in the messages file (from bug#606259) when all the controller faults occur:
port group 00 state T supports ToUsNA

So does alua_rtpg() ever actually get h->state == TPGS_STATE_TRANSITIONING for this implicit alua LUN?

Anyway, ignoring my concern about alua_rtpg() possibly never seeing TPGS_STATE_TRANSITIONING for a moment, something like the following may suffice (this will need Mike Christie's feedback):

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
index a78aaa6..9e116de 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -610,6 +610,9 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_dh_data *h)
                    h->state == TPGS_STATE_STANDBY)
                        /* Useable path if active */
                        err = SCSI_DH_OK;
+               else if (h->state == TPGS_STATE_TRANSITIONING)
+                       /* State transition, retry */
+                       goto retry;
                else
                        /* Path unuseable for unavailable/offline */
                        err = SCSI_DH_DEV_OFFLINED;
@@ -686,8 +689,10 @@ static int alua_prep_fn(struct scsi_device *sdev, struct request *req)
        struct alua_dh_data *h = get_alua_data(sdev);
        int ret = BLKPREP_OK;
 
-       if (h->state != TPGS_STATE_OPTIMIZED &&
-           h->state != TPGS_STATE_NONOPTIMIZED) {
+       if (h->state == TPGS_STATE_TRANSITIONING)
+               ret = BLKPREP_DEFER;
+       else if (h->state != TPGS_STATE_OPTIMIZED &&
+                h->state != TPGS_STATE_NONOPTIMIZED) {
                ret = BLKPREP_KILL;
                req->flags |= REQ_QUIET;
        }

Comment 2 Mike Snitzer 2010-07-29 21:30:47 UTC
Mike, could you please review comment#1, thanks.

Comment 3 Mike Christie 2010-07-30 00:50:47 UTC
Patch looks good to me. I do not know why transitioning was not handled in the prep_fn function before, but BLK_DEFER makes sense to me.

Comment 4 Mike Snitzer 2010-07-30 15:48:47 UTC
Hi Martin,

Here is a summary of outstanding questions that we have:

1) Do you ever see the scsi_dh_alua handler process TPGS_STATE_TRANSITIONING?
   - something like the following in the kernel log:
     port group 00 state T supports ToUsNA

2) What is the maximum time that a NetApp LUN can be in the transitioning state when ALUA is used?
   - is this highly dependent on the amount of IO in the controller cache?
   - seems the kernel is doing the right thing of continuing to retry:
     https://bugzilla.redhat.com/show_bug.cgi?id=559586#c9
   - but that the NetApp LUN stays in the transitioning state beyond 360 seconds:
     https://bugzilla.redhat.com/show_bug.cgi?id=606259#c57

3) Is there an alternative, NetApp supported, configuration if ALUA is disabled (in both the NetApp controller and linux/device-mapper-multipath)?
   - would this alternative resolve the delayed IO behavior seen in the host (Linux) or would the IO delays persist?

Comment 5 Martin George 2010-08-02 15:02:19 UTC
(In reply to comment #4)
> Hi Martin,
> 
> Here is a summary of outstanding questions that we have:
> 
> 1) Do you ever see the scsi_dh_alua handler process TPGS_STATE_TRANSITIONING?
>    - something like the following in the kernel log:
>      port group 00 state T supports ToUsNA

Yes. We see this message when the ALUA handler is in use for NetApp LUNs. And NetApp supports implicit ALUA alone with the following valid states - TRANSITION, UNAVAILABLE, NONOPTIMIZED & OPTIMIZED.

> 
> 2) What is the maximum time that a NetApp LUN can be in the transitioning state
> when ALUA is used?

This should not exceed 120 seconds.

>    - is this highly dependent on the amount of IO in the controller cache?

This is actually dependent on the controller config. If you have several aggregates, volumes, snapshots, etc., on the controllers, the NetApp LUN 'TRANSITIONING' time would be higher during cf takeovers/givebacks.

>    - seems the kernel is doing the right thing of continuing to retry:
>      https://bugzilla.redhat.com/show_bug.cgi?id=559586#c9

Yes. We want the kernel to retry till the 'TRANSITION' is complete. And that's why we chose the ALUA handler for the same.

>    - but that the NetApp LUN stays in the transitioning state beyond 360
> seconds:
>      https://bugzilla.redhat.com/show_bug.cgi?id=606259#c57

Hmm..let me look into that.

> 
> 3) Is there an alternative, NetApp supported, configuration if ALUA is disabled
> (in both the NetApp controller and linux/device-mapper-multipath)?

Yes, you can use non-ALUA configs as well. For this, disable ALUA on the corresponding igroup on the NetApp controller and then use mpath_prio_ontap instead of mpath_prio_alua in the host multipath.conf.

>    - would this alternative resolve the delayed IO behavior seen in the host
> (Linux) or would the IO delays persist?    

Yes, it would resolve the delayed IO behavior since the delayed IO is seen on ALUA setups alone.

Comment 6 Mike Snitzer 2010-08-06 02:07:05 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > Hi Martin,
> > 
> > Here is a summary of outstanding questions that we have:
> > 
> > 1) Do you ever see the scsi_dh_alua handler process TPGS_STATE_TRANSITIONING?
> >    - something like the following in the kernel log:
> >      port group 00 state T supports ToUsNA
> 
> Yes. We see this message when the ALUA handler is in use for NetApp LUNs. And
> NetApp supports implicit ALUA alone with the following valid states -
> TRANSITION, UNAVAILABLE, NONOPTIMIZED & OPTIMIZED.

I can confirm that I have seen instances of the following too:
scsi 1:0:3:0: alua: port group 01 state T supports ToUsNA

This means the alua_rtpg() hunk from the patch in comment#1 is beneficial.

But I have yet to see proof (from my debug kernel's scsi_dh_alua tracing) that the 2nd hunk, which changes alua_prep_fn, from the patch in comment#1 helps.  My debugging would print "alua_prep_fn: TPGS_STATE_TRANSITIONING" if that path was taken.

(and I've been doing a lot of takeover/giveback testing under dt load).

> > 2) What is the maximum time that a NetApp LUN can be in the transitioning state
> > when ALUA is used?
> 
> This should not exceed 120 seconds.
> 
> >    - is this highly dependent on the amount of IO in the controller cache?
> 
> This is actually dependent on the controller config. If you have several
> aggregates, volumes, snapshots, etc., on the controllers, the NetApp LUN
> 'TRANSITIONING' time would be higher during cf takeovers/givebacks.

Is there anything of note about your backend LUN config that we should look to replicate in our config related to the above?  Meaning: do you have many snapshots, several aggregates, etc?

Comment 7 Martin George 2010-08-06 12:48:42 UTC
(In reply to comment #6)
> 
> Is there anything of note about your backend LUN config that we should look to
> replicate in our config related to the above?  Meaning: do you have many
> snapshots, several aggregates, etc?

No. You can ignore snapshots, aggregates, etc. Just stick to the config mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=606259#c52

Comment 8 Martin George 2010-08-09 19:45:55 UTC
(In reply to comment #6)
> 
> But I have yet to see proof (from my debug kernel's scsi_dh_alua tracing) that
> the 2nd hunk, which changes alua_prep_fn, from the patch in comment#1 helps. 
> My debugging would print "alua_prep_fn: TPGS_STATE_TRANSITIONING" if that path
> was taken.
> 

I'm seeing similar behavior as well on our setup here. Only "alua_rtpg: trying submit_rtpg" messages are visible, but not the "alua_prep_fn: TPGS_STATE_TRANSITIONING" messages.

Comment 9 Martin George 2010-08-11 07:55:26 UTC
And now I have hit something worse. To avoid hitting bug 599487 on Emulex hosts, I turned off the Emulex heartbeat parameter 'lpfc_enable_hba_heartbeat' to 0 as recommended in this bug.

And the host paniced during controller takeover/givebacks - this is with the alua debug kernel containing lpfc driver v8.2.0.63.3p:

Kernel BUG at drivers/scsi/lpfc/lpfc_scsi.c:2206
invalid opcode: 0000 [1] SMP 
last sysfs file: /block/dm-14/dev
CPU 3 
Modules linked in: nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg floppy tg3 pcspkr i2c_i801 i2c_core e752x_edac edac_mc ide_cd serio_raw cdrom dm_raid45 dm_message dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp lpfc scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 438, comm: scsi_eh_0 Not tainted 2.6.18-194.11.1.el5.alua_dbg #1
RIP: 0010:[<ffffffff880ff793>]  [<ffffffff880ff793>] :lpfc:lpfc_abort_handler+0x58/0x33d
RSP: 0018:ffff81007e705dd0  EFLAGS: 00010246
RAX: ffff81003531a680 RBX: ffff81003531a680 RCX: ffff81007e705e90
RDX: ffff81007e705e90 RSI: ffff81003531a698 RDI: ffff81007e6eb050
RBP: ffff81007e624000 R08: ffff81007e704000 R09: 000000000000003c
R10: ffff810002390a90 R11: ffffffff880ff73b R12: 0000000000000000
R13: 0000000000000282 R14: ffff81007e401b58 R15: ffffffff800a07c0
FS:  0000000000000000(0000) GS:ffff8100026ca6c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b081fcca1d0 CR3: 0000000013601000 CR4: 00000000000006e0
Process scsi_eh_0 (pid: 438, threadinfo ffff81007e704000, task ffff81007f949100)
Stack:  ffff81003531a680 ffff81007e6eb000 ffff81007e6eb4f8 000020023b9aca00
 ffff810000000000 0000000300000001 ffff81007e705e00 ffff81007e705e00
 0000958c9102002a 0000000000000018 ffff810000000001 ffff81007e705e28
Call Trace:
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff880791a4>] :scsi_mod:scsi_error_handler+0x290/0x4ac
 [<ffffffff88078f14>] :scsi_mod:scsi_error_handler+0x0/0x4ac
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003287b>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a07c0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003277d>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 0f 0b 68 dd e0 11 88 c2 9e 08 4d 8b 7c 24 10 4c 3b 3c 24 0f 
RIP  [<ffffffff880ff793>] :lpfc:lpfc_abort_handler+0x58/0x33d
 RSP <ffff81007e705dd0>
 <0>Kernel panic - not syncing: Fatal exception

Comment 10 Martin George 2010-08-11 08:21:30 UTC
I'll now try with the lpfc patch mentioned in the same bug 599487 - hopefully that should resolve the panic. 

Meanwhile tests on the QLogic host are running fine so far - I have not hit any delayed IO on it yet. And from the /var/log/messages, I see both hunks of the patch being executed:

# cat /var/log/messages|grep ToUsNA
Aug 10 16:41:03 IBMx336-200-134 kernel: scsi 0:0:0:0: alua: port group 01 state N supports ToUsNA
Aug 10 16:41:04 IBMx336-200-134 kernel: scsi 0:0:0:11: alua: port group 01 state N supports ToUsNA
Aug 10 16:41:04 IBMx336-200-134 kernel: scsi 0:0:0:12: alua: port group 01 state N supports ToUsNA
Aug 10 16:41:04 IBMx336-200-134 kernel: scsi 0:0:0:13: alua: port group 01 state N supports ToUsNA
....

# cat /var/log/messages|grep alua_rtpg
Aug 10 16:41:03 IBMx336-200-134 kernel: scsi 0:0:0:0: alua_rtpg: trying submit_rtpg
Aug 10 16:41:04 IBMx336-200-134 kernel: scsi 0:0:0:11: alua_rtpg: trying submit_rtpg
Aug 10 16:41:04 IBMx336-200-134 kernel: scsi 0:0:0:12: alua_rtpg: trying submit_rtpg
Aug 10 16:41:04 IBMx336-200-134 kernel: scsi 0:0:0:14: alua_rtpg: trying submit_rtpg
....

# cat /var/log/messages|grep alua_prep
Aug 10 22:02:45 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn: TPGS_STATE_TRANSITIONING
Aug 10 22:02:45 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn: TPGS_STATE_TRANSITIONING
Aug 10 22:02:46 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn: TPGS_STATE_TRANSITIONING
Aug 10 22:02:46 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn: TPGS_STATE_TRANSITIONING
....

But I noticed that the alua_prep_fn messages shown above is called for sd 0:0:0:1 alone from the logs.

Comment 11 Mike Snitzer 2010-08-11 14:14:16 UTC
(In reply to comment #10)
> I'll now try with the lpfc patch mentioned in the same bug 599487 - hopefully
> that should resolve the panic. 
> 
> Meanwhile tests on the QLogic host are running fine so far - I have not hit any
> delayed IO on it yet. And from the /var/log/messages, I see both hunks of the
> patch being executed:

OK, I'll be posting the patch upstream as well as prep'ing a patch for 5.6.

> # cat /var/log/messages|grep ToUsNA
> Aug 10 16:41:03 IBMx336-200-134 kernel: scsi 0:0:0:0: alua: port group 01 state
> N supports ToUsNA
> Aug 10 16:41:04 IBMx336-200-134 kernel: scsi 0:0:0:11: alua: port group 01
> state N supports ToUsNA
> Aug 10 16:41:04 IBMx336-200-134 kernel: scsi 0:0:0:12: alua: port group 01
> state N supports ToUsNA
> Aug 10 16:41:04 IBMx336-200-134 kernel: scsi 0:0:0:13: alua: port group 01
> state N supports ToUsNA
> ....

OK, but to be clear, the "N" variety was always possible.  The new code I added introduces messages with "T" like:
scsi 1:0:3:0: alua: port group 01 state T supports ToUsNA

> # cat /var/log/messages|grep alua_prep
> Aug 10 22:02:45 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn:
> TPGS_STATE_TRANSITIONING
> Aug 10 22:02:45 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn:
> TPGS_STATE_TRANSITIONING
> Aug 10 22:02:46 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn:
> TPGS_STATE_TRANSITIONING
> Aug 10 22:02:46 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn:
> TPGS_STATE_TRANSITIONING
> ....
> 
> But I noticed that the alua_prep_fn messages shown above is called for sd
> 0:0:0:1 alone from the logs.    

OK, its not clear to me why TPGS_STATE_TRANSITIONING would be confined to that one device.

Comment 12 Mike Snitzer 2010-08-11 14:23:50 UTC
(In reply to comment #9)
> And now I have hit something worse. To avoid hitting bug 599487 on Emulex
> hosts, I turned off the Emulex heartbeat parameter 'lpfc_enable_hba_heartbeat'
> to 0 as recommended in this bug.
> 
> And the host paniced during controller takeover/givebacks - this is with the
> alua debug kernel containing lpfc driver v8.2.0.63.3p:

I'm running that same kernel (2.6.18-194.11.1.el5.alua_dbg) without problems during the cf takeover/giveback test on a host with lpfc (0:8.2.0.63.3p).  I guess I just haven't been lucky enough to hit bug 599487 -- that said I haven't disabled 'lpfc_enable_hba_heartbeat' either.

Comment 13 Martin George 2010-08-11 17:47:42 UTC
(In reply to comment #11)
> 
> OK, but to be clear, the "N" variety was always possible.  The new code I added
> introduces messages with "T" like:
> scsi 1:0:3:0: alua: port group 01 state T supports ToUsNA
> 

Yes, I see that as well:

# cat /var/log/messages|grep "state T"
Aug 10 19:59:27 IBMx336-200-134 kernel: sd 0:0:0:10: alua: port group 01 state T supports ToUsNA
Aug 10 22:02:45 IBMx336-200-134 kernel: sd 0:0:0:1: alua: port group 01 state T supports ToUsNA
Aug 10 22:22:38 IBMx336-200-134 kernel: sd 1:0:1:25: alua: port group 03 state T supports ToUsNA
....

Comment 14 RHEL Program Management 2010-08-23 20:49:51 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 16 Mike Snitzer 2010-09-17 20:06:02 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > Meanwhile tests on the QLogic host are running fine so far - I have not hit any
> > delayed IO on it yet. And from the /var/log/messages, I see both hunks of the
> > patch being executed:
> 
> OK, I'll be posting the patch upstream as well as prep'ing a patch for 5.6.

In response to having posted the patch upstream (to linux-scsi), Hannes Reinecke had the following insight:
http://www.spinics.net/lists/linux-scsi/msg46193.html

Hannes' first critique is actually what was intended by the patch:

"The path is retried indefinitely. Arrays are _supposed_ to be in 'transitioning'
only temporary; however, if the array is stuck due to a fw error we're stuck in 'defer', too."

> > # cat /var/log/messages|grep alua_prep
> > Aug 10 22:02:45 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn:
> > TPGS_STATE_TRANSITIONING
> > Aug 10 22:02:45 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn:
> > TPGS_STATE_TRANSITIONING
> > Aug 10 22:02:46 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn:
> > TPGS_STATE_TRANSITIONING
> > Aug 10 22:02:46 IBMx336-200-134 kernel: sd 0:0:0:1: alua_prep_fn:
> > TPGS_STATE_TRANSITIONING
> > ....
> > 
> > But I noticed that the alua_prep_fn messages shown above is called for sd
> > 0:0:0:1 alone from the logs.    
> 
> OK, its not clear to me why TPGS_STATE_TRANSITIONING would be confined to that
> one device.

But Hannes' second point of critique may help explain the behaviour Martin saw:
"Secondly this path fails with 'directio' multipath checker. Remember that 'directio'
is using 'fs' requests, not block-pc ones. Hence for all I/O the prep_fn() callback
is evaluated, which will return 'DEFER' here once the path is in transitioning.
And the state is never updated as RTPG is never called."

So I think the 2nd hunk of the patch (which modifies alua_prep_fn) needs to be dropped.

I'll ping Mike Christie to see what he thinks.

Comment 17 Mike Snitzer 2010-09-23 21:48:54 UTC
Just posted v3, which is the result of upstream review on linux-scsi:
http://www.spinics.net/lists/linux-scsi/msg46988.html

Comment 19 Jarod Wilson 2010-09-27 19:12:01 UTC
in kernel-2.6.18-225.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 23 Martin George 2010-10-11 14:25:43 UTC
(In reply to comment #16)
> But Hannes' second point of critique may help explain the behaviour Martin saw:
> "Secondly this path fails with 'directio' multipath checker. Remember that
> 'directio'
> is using 'fs' requests, not block-pc ones. Hence for all I/O the prep_fn()
> callback
> is evaluated, which will return 'DEFER' here once the path is in transitioning.
> And the state is never updated as RTPG is never called."
> 
> So I think the 2nd hunk of the patch (which modifies alua_prep_fn) needs to be
> dropped.
> 
> I'll ping Mike Christie to see what he thinks.

So what's the final take on this? Is there a problem using the directio checker with the ALUA handler? i.e. should we switch to some other path checkers like tur or readsector0 when using the ALUA handler?

Comment 24 Mike Snitzer 2010-10-11 18:55:41 UTC
(In reply to comment #23)
> (In reply to comment #16)
> > But Hannes' second point of critique may help explain the behaviour Martin saw:
> > "Secondly this path fails with 'directio' multipath checker. Remember that
> > 'directio'
> > is using 'fs' requests, not block-pc ones. Hence for all I/O the prep_fn()
> > callback
> > is evaluated, which will return 'DEFER' here once the path is in transitioning.
> > And the state is never updated as RTPG is never called."
> > 
> > So I think the 2nd hunk of the patch (which modifies alua_prep_fn) needs to be
> > dropped.
> > 
> > I'll ping Mike Christie to see what he thinks.
> 
> So what's the final take on this? Is there a problem using the directio checker
> with the ALUA handler? i.e. should we switch to some other path checkers like
> tur or readsector0 when using the ALUA handler?

The above concern with directio path checker was specific to the patch that was being discussed upstream.  This concern with directio has been resolved with the 5.6 fix (which is the equivalent of the upstream fix).  So I'm not aware of any reason why directio should be avoided for ALUA.

Though Mike Christie did have some concern that directio could cause unnecessary transitions here:
https://bugzilla.redhat.com/show_bug.cgi?id=606259#c69

Setting needinfo to get Mike Christie's (or Ben Marzinski's) thoughts on tur vs directio w/ ALUA.

Comment 25 Mike Snitzer 2010-10-11 19:09:52 UTC
(In reply to comment #24)
> Setting needinfo to get Mike Christie's (or Ben Marzinski's) thoughts on tur vs
> directio w/ ALUA.

Ben actually responded to Mike Christie's question with this:
https://bugzilla.redhat.com/show_bug.cgi?id=606259#c75

So Martin, RHEL5 kernel >= 2.6.18-225.el5 has the fix, and Jarod provided a link to download this kernel in comment#19

Have you tried this kernel with directio?  Do you have a specific concern or was your question just a continuation of previous concern raised in bz#606259 ?

Comment 26 Mike Christie 2010-10-11 22:33:17 UTC
I think directio would only be a problem if multipathd was testing all paths. If it is only testing paths that are down we should be ok.

Comment 27 Martin George 2010-10-12 11:40:58 UTC
(In reply to comment #25)
> (In reply to comment #24)
> 
> So Martin, RHEL5 kernel >= 2.6.18-225.el5 has the fix, and Jarod provided a
> link to download this kernel in comment#19
> 
> Have you tried this kernel with directio?  

No, I have not yet tried this.

> Do you have a specific concern or was your question just a continuation of previous concern raised in bz#606259 ?

My query was in context of both - the directio concerns raised during the upstream discussion & the concerns raised by Ben & Mike Christie in bug 606259. Seeing these discussions, one does get the impression that you may run into problems if using directio - something that may be avoided with other checkers like tur. I am just looking for a confirmation on this from Red Hat.

Comment 28 Mike Snitzer 2010-10-12 13:22:00 UTC
(In reply to comment #27)
> (In reply to comment #25)
> > Do you have a specific concern or was your question just a continuation of previous concern raised in bz#606259 ?
> 
> My query was in context of both - the directio concerns raised during the
> upstream discussion

That discussion is independent of any code that has ever shipped in RHEL or upstream.  It was a problem with a specific patch that was proposed.  That patch was never used.

> & the concerns raised by Ben & Mike Christie in bug 606259.
> Seeing these discussions, one does get the impression that you may run into
> problems if using directio - something that may be avoided with other checkers
> like tur. I am just looking for a confirmation on this from Red Hat.

But I'll follow-up with Ben on the use of directio given his reply here:
https://bugzilla.redhat.com/show_bug.cgi?id=606259#c75

In comment#26, Mike Christie speculated that testing all paths with directio could be a problem.  So we'll work on getting you confirmation.  Thanks.

Comment 29 Ben Marzinski 2010-10-12 15:32:45 UTC
Without setting up a NetApp box to test this, I can't say for certain, but we've had the checker set to directio for a while now, and when I've used one in the past, I've never noticed any ping-ponging.  Multipath does check both the active and the failed paths, but ping-ponging on an ALUA setup seems unlikely to me. Do you know if reading a single sector's worth of IO from the non-optimal path will cause it to transition if you have implict ALUA setup?  I assume not.  On most arrays, the non-optimal path needs to recieve significantly more IO than the optimal path for the array to switch which controller manages the LUN.  Otherwise, what you have is an active/passive array, that can automatically transfer the active path, which is not what ALUA is.  With multipathd, both paths get checked just as often, so the amount of checker IO should be the same, however the optimal path gets all the IO coming to the multipath device, so there will never be a time when the non-optimal path is getting more IO than the optimal path.

Comment 31 Martin George 2010-11-15 15:25:54 UTC
Could we please have this ALUA transitioning fix backported to 5.5.z? And that means backporting the jiffies related fix in bug 556476 to 5.5.z as well.

Comment 35 Mike Snitzer 2010-11-30 19:26:06 UTC
*** Bug 606259 has been marked as a duplicate of this bug. ***

Comment 36 Chris Ward 2010-12-02 15:32:14 UTC
Reminder! There should be a fix present for this BZ in snapshot 3 -- unless otherwise noted in a previous comment.

Please test and update this BZ with test results as soon as possible.

Comment 37 Eryu Guan 2010-12-21 11:45:23 UTC
Any test results available here?

Comment 39 Andrius Benokraitis 2010-12-21 15:00:26 UTC
Action on NetApp to test this ASAP. Any results, Martin???

Comment 41 Mike Snitzer 2010-12-21 22:37:26 UTC
(In reply to comment #39)
> Action on NetApp to test this ASAP. Any results, Martin???

Martin,

The 5.6 kernel may be downloaded from here:
http://people.redhat.com/jwilson/el5/238.el5/

Comment 42 Rajashekhar M A 2010-12-22 13:45:12 UTC
Test results look good. Updated 'PartnerVerified' accordingly.

Comment 44 errata-xmlrpc 2011-01-13 21:46:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.