482737 – Add explicit ALUA support to kernel

Bug 482737 - Add explicit ALUA support to kernel

Summary: Add explicit ALUA support to kernel

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	5.4
Assignee:	Mike Christie
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	459808 RHEL5u4_relnotes 483701 483784 485920
TreeView+	depends on / blocked

Reported:	2009-01-27 19:38 UTC by Mike Christie
Modified:	2018-11-27 20:09 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Asymmetric Logical Unit Access (ALUA) support in device-mapper-multipath has been updated, adding explicit ALUA support for Clariion storage. Earlier versions of Red Hat Enterprise Linux 5 added support for implicit ALUA (i.e. the operating system is not aware of which storage device paths have optimized performance and which have non-optimized performance). If the operating system consistently sends I/O on a non-optimized path, then the storage device may transparently make that path optimized, improving performance and causing idle paths to become non-optimized. Red Hat Enterprise Linux 5.4 introduces explicit ALUA support for Clariion storage (i.e. the operating system exchanges information with the storage device and is able to select the paths that have optimized performance).
Clone Of:
Environment:
Last Closed:	2009-09-02 08:30:56 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
attach to emc clarrion automatically (892 bytes, patch) 2009-03-23 21:33 UTC, Mike Christie	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Mike Christie 2009-01-27 19:38:04 UTC

Description of problem:

This is a request to add the scsi_dh_alua module from upstream to RHEL 5.4.

The scsi_dh frame work is in 5.3 already. We just need to bring in the scsi_dh_alua module and modify it for changes in our kernel. For example the reuqest setup is different
https://bugzilla.redhat.com/show_bug.cgi?id=471920

We also need to add the vendors ids to the default dev table.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2009-01-27 20:44:18 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 RHEL Program Management 2009-02-16 15:17:18 UTC

Updating PM score.

Comment 3 Mike Christie 2009-03-12 21:29:50 UTC

Here is the current ALUA patch:
http://people.redhat.com/mchristi/scsi/alua/5.4/0001-RHEL-5.4-Add-ALUA-v2.patch

It was made over this kernel:
http://people.redhat.com/dzickus/el5/134.el5/

Please test.

Comment 4 Wayne Berthiaume 2009-03-13 12:45:36 UTC

Hi Mike.

  Does this mean the kernel 134.el5 kernel contains the patch and we need to testt his kernel or do we need to patch that kernel for our testing?

Regards,
Wayne.

Comment 5 Mike Christie 2009-03-13 20:10:51 UTC

(In reply to comment #4)
> Hi Mike.
> 
>   Does this mean the kernel 134.el5 kernel contains the patch and we need to
> testt his kernel or do we need to patch that kernel for our testing?
> 

Patch that kernel.

If you do not have any kernel guys that know how to make kernels let me know. I can make a rpm for you. Also let me know the arch (i686, x86_64, ppc, etc) that you are using so I can just make rpm for the arch you want to test.

You probably want someone that can make kernels though. I suspect it needs some love that will require someone to look at the code.

Comment 6 Don 2009-03-20 16:29:50 UTC

Mike could you make me an rpm for x86_64.

Thanks
Don blood

Comment 7 Mike Christie 2009-03-23 21:33:07 UTC

Created attachment 336390 [details]
attach to emc clarrion automatically

In that patch I ref in a previous comment, http://people.redhat.com/mchristi/scsi/alua/5.4/0001-RHEL-5.4-Add-ALUA-v2.patch,
I did not add the EMC clarrion strings so it would automatically attach.


This patch has us attach to devices with {vendor,model} as

+       {"DGC", "RAID"},
+       {"DGC", "DISK"},

Comment 8 Mike Christie 2009-03-23 21:33:48 UTC

The rpm with alua and which should attach to emc clarrion device automatically is here:

http://people.redhat.com/mchristi/scsi/alua/tmp/kernel-2.6.18-136.el5dz_test.alua2.x86_64.rpm

Comment 9 Don 2009-03-30 12:27:37 UTC

I have inststalled this kernel and and I have the same isssue as I logged in bugzilla 471920, multipath -ll returns nothing.

Comment 10 Mike Christie 2009-03-31 01:35:18 UTC

(In reply to comment #9)
> I have inststalled this kernel and and I have the same isssue as I logged in
> bugzilla 471920, multipath -ll returns nothing.  



For the kernel you tested and before you even run multipath -ll:

1. Is scsi_dh_alua loaed?

2. If you do "lsmod | grep scsi_dh_alua" what is the "Used by" count?

3. Does the /var/log/messages for the logical units using alua indicate that scsi_dh_alua detected they were using alua and were they setup right or are there errors?

Now if that is all ok:

4. When you run

#/sbin/multipath

to setup the multipath devices (no -ll)

do you see any errors about scsi_dh_not getting loaded or not being present?

Comment 11 Don 2009-03-31 14:04:06 UTC

1. Is scsi_dh_alua loaed?

No so I loaded it with a modeprobe and the /var/log/messages file had alot of these entries.

Mar 31 09:42:40 lynx kernel: sector 32282767, nr/cnr 0/0
Mar 31 09:42:40 lynx kernel: bio ffff81011dd0a400, biotail ffff81011dd0a400, buffer 0000000000000000, data 0000000000000000, len 36
Mar 31 09:42:40 lynx kernel: sd 5:0:0:0: alua: not supported
Mar 31 09:42:40 lynx kernel: sd 5:0:0:0: alua: not attached
Mar 31 09:42:40 lynx kernel: SCSI bad req: dev ?: flags = REQ_SOFTBARRIER REQ_NOMERGE REQ_STARTED 
Mar 31 09:42:40 lynx kernel: sector 32282767, nr/cnr 0/0
Mar 31 09:42:40 lynx kernel: bio ffff81011dd0a400, biotail ffff81011dd0a400, buffer 0000000000000000, data 0000000000000000, len 36
Mar 31 09:42:40 lynx kernel: sd 5:0:0:1: alua: not supported
Mar 31 09:42:40 lynx kernel: sd 5:0:0:1: alua: not attached
Mar 31 09:42:40 lynx kernel: SCSI bad req: dev ?: flags = REQ_SOFTBARRIER REQ_NOMERGE REQ_STARTED 
Mar 31 09:42:40 lynx kernel: sector 32282767, nr/cnr 0/0

I then tried 
 /sbin/multipath
error calling out mpath_prio_alua /dev/sdc
error calling out mpath_prio_alua /dev/sdd
error calling out mpath_prio_alua /dev/sde
error calling out mpath_prio_alua /dev/sdf
error calling out mpath_prio_alua /dev/sdg
error calling out mpath_prio_alua /dev/sdh
error calling out mpath_prio_alua /dev/sdi
error calling out mpath_prio_alua /dev/sdj
error calling out mpath_prio_alua /dev/sdk
error calling out mpath_prio_alua /dev/sdl
error calling out mpath_prio_alua /dev/sdm
error calling out mpath_prio_alua /dev/sdn
error calling out mpath_prio_alua /dev/sdo
error calling out mpath_prio_alua /dev/sdp
error calling out mpath_prio_alua /dev/sdq
error calling out mpath_prio_alua /dev/sdr
error calling out mpath_prio_alua /dev/sdc
error calling out mpath_prio_alua /dev/sdg
error calling out mpath_prio_alua /dev/sdk
error calling out mpath_prio_alua /dev/sdo
DM message failed [queue_if_no_path
]
error calling out mpath_prio_alua /dev/sdd
error calling out mpath_prio_alua /dev/sdh
error calling out mpath_prio_alua /dev/sdl
error calling out mpath_prio_alua /dev/sdp
DM message failed [queue_if_no_path
]
error calling out mpath_prio_alua /dev/sde
error calling out mpath_prio_alua /dev/sdi
error calling out mpath_prio_alua /dev/sdm
error calling out mpath_prio_alua /dev/sdq
DM message failed [queue_if_no_path
]
error calling out mpath_prio_alua /dev/sdf
error calling out mpath_prio_alua /dev/sdj
error calling out mpath_prio_alua /dev/sdn
error calling out mpath_prio_alua /dev/sdr
DM message failed [queue_if_no_path

/tail/var/messages

Mar 31 09:57:03 lynx kernel: device-mapper: table: 253:2: multipath: error getting device
Mar 31 09:57:03 lynx kernel: device-mapper: ioctl: error adding target to table
Mar 31 09:57:03 lynx kernel: device-mapper: table: 253:2: multipath: error getting device
Mar 31 09:57:03 lynx kernel: device-mapper: ioctl: error adding target to table
Mar 31 09:57:03 lynx multipathd: dm-2: remove map (uevent) 
Mar 31 09:57:03 lynx kernel: device-mapper: table: 253:2: multipath: error getting device
Mar 31 09:57:03 lynx kernel: device-mapper: ioctl: error adding target to table
Mar 31 09:57:03 lynx multipathd: dm-2: remove map (uevent) 
Mar 31 09:57:03 lynx multipathd: dm-2: remove map (uevent) 
Mar 31 09:57:03 lynx kernel: device-mapper: table: 253:2: multipath: error getting device
Mar 31 09:57:03 lynx kernel: device-mapper: ioctl: error adding target to table
Mar 31 09:57:03 lynx multipathd: dm-2: remove map (uevent)

Comment 12 Mike Christie 2009-03-31 16:27:00 UTC

(In reply to comment #11)
> 1. Is scsi_dh_alua loaed?
> 
> No so I loaded it with a modeprobe and the /var/log/messages file had alot of
> these entries.
> 
> Mar 31 09:42:40 lynx kernel: sector 32282767, nr/cnr 0/0
> Mar 31 09:42:40 lynx kernel: bio ffff81011dd0a400, biotail ffff81011dd0a400,
> buffer 0000000000000000, data 0000000000000000, len 36
> Mar 31 09:42:40 lynx kernel: sd 5:0:0:0: alua: not supported
> Mar 31 09:42:40 lynx kernel: sd 5:0:0:0: alua: not attached
> Mar 31 09:42:40 lynx kernel: SCSI bad req: dev ?: flags = REQ_SOFTBARRIER
> REQ_NOMERGE REQ_STARTED 

Thanks for the info. There is some bug in there with the cmd flags still. It is strange that this is working ok with other targets though. It looks like something is clearing some of the bits we had set. I will get the machine we have here and test it out.

Comment 13 Mike Christie 2009-03-31 16:48:14 UTC

Ok, I found the problem. The scsi_dh_alua module itself was clearing the flags after it set it.

+
+	memset(rq->cmd, 0, BLK_MAX_CDB);
+	rq->flags |= REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT |
+			REQ_FAILFAST_DRIVER | REQ_NOMERGE | REQ_BLOCK_PC;
+	rq->retries = ALUA_FAILOVER_RETRIES;
+	rq->timeout = ALUA_FAILOVER_TIMEOUT;
+	rq->flags = 0;

I will make a new kernel.

Comment 14 Mike Christie 2009-03-31 23:52:35 UTC

Here is a updated patch
http://people.redhat.com/mchristi/scsi/alua/5.4/0001-RHEL-5.4-Add-ALUA-v3.patch

And a updated x86_64 kernel.
http://people.redhat.com/mchristi/scsi/alua/tmp/kernel-2.6.18-136.el5dz_test.alua3.x86_64.rpm


You should not have to manually load scsi_dh_alua. When you run the /sbin/multipath command it should load it for you. But for a first test to make sure that the alua detection is correct could you do the modprobe scsi_dh_alua then do the /sbin/multipath?

Comment 15 Don 2009-04-01 15:58:45 UTC

When I did that I recieved this in the messages file

Apr  1 11:48:54 lynx multipathd: cannot open mpath_prio_alua : No such file or directory  
Apr  1 11:48:54 lynx multipathd: cannot open /sbin/dasd_id : No such file or directory  
Apr  1 11:48:54 lynx multipathd: cannot open /sbin/gnbd_import : No such file or directory  
Apr  1 11:48:54 lynx multipathd: [copy.c] cannot open mpath_prio_alua 
Apr  1 11:48:54 lynx multipathd: cannot copy mpath_prio_alua in ramfs : No such file or directory 
Apr  1 11:48:54 lynx multipathd: [copy.c] cannot open /sbin/dasd_id 
Apr  1 11:48:54 lynx multipathd: cannot copy /sbin/dasd_id in ramfs : No such file or directory 
Apr  1 11:48:54 lynx multipathd: [copy.c] cannot open /sbin/gnbd_import 
Apr  1 11:48:54 lynx multipathd: cannot copy /sbin/gnbd_import in ramfs : No such file or directory 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdc 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdd 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sde 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdf 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdg 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdh 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdi 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdj 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdk 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdl 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdm 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdn 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdo 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdp 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdq 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdr 
Apr  1 11:48:54 lynx kernel: device-mapper: table: 253:2: multipath: error getting device
Apr  1 11:48:54 lynx kernel: device-mapper: ioctl: error adding target to table
Apr  1 11:48:54 lynx kernel: device-mapper: table: 253:2: multipath: error getting device
Apr  1 11:48:54 lynx kernel: device-mapper: ioctl: error adding target to table
Apr  1 11:48:54 lynx kernel: device-mapper: table: 253:2: multipath: error getting device
Apr  1 11:48:54 lynx kernel: device-mapper: ioctl: error adding target to table
Apr  1 11:48:54 lynx kernel: device-mapper: table: 253:2: multipath: error getting device
Apr  1 11:48:54 lynx kernel: device-mapper: ioctl: error adding target to table
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdc 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdg 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdk 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdo 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdd 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdh 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdl 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdp 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sde 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdi 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdm 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdq 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdf 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdj 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdn 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdr 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdc 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdg 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdk 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdo 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdd 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdh 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdl 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdp 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sde 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdi 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdm 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdq 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdf 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdj 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdn 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdr 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdc 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdg 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdk 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdo 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdd 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdh 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdl 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdp 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sde 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdi 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdm 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdq 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdf 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdj 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdn 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdr 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdc 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdg 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdk 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdo 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdd 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdh 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdl 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdp 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sde 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdi 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdm 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdq 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdf 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdj 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdn 
Apr  1 11:48:54 lynx multipathd: error calling out mpath_prio_alua /dev/sdr 
Apr  1 11:48:54 lynx multipathd: DM message failed [queue_if_no_path ] 
Apr  1 11:48:54 lynx multipathd: path checkers start up

Comment 16 Mike Christie 2009-04-01 18:03:49 UTC

Did you set the multipath conf to use mpath_prio_alua?

I am ccing Benjamin Marzinski, because it looks like a multipathd issue.

Comment 17 Don 2009-04-01 18:36:34 UTC

yes

# Version  : 1.0
# 
defaults {
user_friendly_names  yes
}
#
# The blacklist is the enumeration of all devices that are to be
# excluded from multipath control
blacklist {
       devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
       devnode "^hd[a-z][[0-9]*]"
       devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
}

devices {
#      Device attributed for EMC CLARiiON
        device {
                vendor                  "DGC"
                product                 "*"
                path_grouping_policy    group_by_prio
                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
#               prio_callout            "/sbin/mpath_prio_emc /dev/%n"
                prio_callout            "mpath_prio_alua /dev/%n"
#               path_checker            emc_clariion 
                path_checker            alua  
                features                "1 queue_if_no_path"                   
#               hardware_handler        "1 emc"
                hardware_handler        "1 alua"
                failback                immediate
                no_path_retry           12
        }
}

Comment 18 Ben Marzinski 2009-04-02 18:08:44 UTC

In RHEL5 multipathd loads all the callout programs into a private ramfs, in case it loses access to the filesystem where they are stored.  It needs the full path name of the callouts to do this.  In your multipath.conf, you have

#               prio_callout            "/sbin/mpath_prio_emc /dev/%n"
                prio_callout            "mpath_prio_alua /dev/%n


You need to specify the full path name for the prio callout.

Comment 19 Don 2009-04-07 18:20:34 UTC

I have this configured and I am seeing what I would expect from a multipath -ll. I will continue to test this and post my findings.

Comment 20 Don Zickus 2009-05-06 17:16:10 UTC

in kernel-2.6.18-144.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 22 Chris Ward 2009-06-14 23:20:16 UTC

~~ Attention Partners RHEL 5.4 Partner Alpha Released! ~~

RHEL 5.4 Partner Alpha has been released on partners.redhat.com. There should
be a fix present that addresses this particular request. Please test and report back your results here, at your earliest convenience. Our Public Beta release is just around the corner!

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have verified the request functions as expected, please set your Partner ID in the Partner field above to indicate successful test results. Do not flip the bug status to VERIFIED. Further questions can be directed to your Red Hat Partner Manager. Thanks!

Comment 23 Don 2009-06-18 18:22:58 UTC

I have the latest kernel from http://people.redhat.com/dzickus/el5         kernel-2.6.18-153.el5 and I am pleased to say testing has gone very well. Next up will be when I get a 5.4 BETA

Comment 24 Chris Ward 2009-07-03 18:22:17 UTC

~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 25 Don 2009-07-10 12:42:17 UTC

I have begun testing 5.4 Beta and it seems to be a problem right up front.
scsi_dh_alua does not load automatically. So I have loaded it with modprobe .
It seems to load ok it has a few of these in /var/log/messages

Jul  9 15:13:46 cranmore kernel: sd 5:0:0:0: alua: supports implicit and explicit TPGS
Jul  9 15:13:46 cranmore kernel: sd 5:0:0:0: alua: port group 02 rel port 0f
Jul  9 15:13:46 cranmore kernel: sd 5:0:0:0: alua: port group 02 state N supports toUsNA
Jul  9 15:13:46 cranmore kernel: sd 5:0:0:1: alua: supports implicit and explicit TPGS
Jul  9 15:13:46 cranmore kernel: sd 5:0:0:1: alua: port group 02 rel port 0f
Jul  9 15:13:46 cranmore kernel: sd 5:0:0:1: alua: port group 02 state A supports toUsNA

Then I did a
/sbin/multipath
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdb
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdc
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdd
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sde
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdf
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdg
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdh
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdi
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdb
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdd
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdf
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdh
DM message failed [queue_if_no_path]
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdc
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sde
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdg
not enough space for response from mpath_prio_alua
mpath_prio_alua exitted with 255
error calling out mpath_prio_alua /dev/sdi
DM message failed [queue_if_no_path

And in /var/log/messages was
Jul  9 15:29:09 cranmore kernel: device-mapper: table: 253:2: multipath: error getting device
Jul  9 15:29:09 cranmore kernel: device-mapper: ioctl: error adding target to table
Jul  9 15:29:09 cranmore multipathd: dm-2: remove map (uevent) 
Jul  9 15:29:09 cranmore kernel: device-mapper: table: 253:2: multipath: error getting device
Jul  9 15:29:09 cranmore kernel: device-mapper: ioctl: error adding target to table
Jul  9 15:29:09 cranmore multipathd: dm-2: remove map (uevent) 

And I still cannot see any of the devices.
My /etc/mutipath.conf file is the same one I used with 5.3 and the 153 kernel that worked. 
This is the 155 kernel that is part of the Beta release.

Comment 26 Andrius Benokraitis 2009-07-14 12:59:22 UTC

So Don - I noticed you tested both -153 and -155, -153 succeeded but -155 failed, is this correct?

Comment 28 Don 2009-07-16 04:24:37 UTC

that is correct

Comment 29 Mike Christie 2009-07-16 18:35:09 UTC

Don,

For the autoload comment, did you mean that scsi_dh_alua was not autoloaded when you ran the multipath command? I think it should only get autoloaded when after you run the multipath command.

Ben, nothing changed in scsi_dh_alua between 153 and 155. What are the multipath tools messages? Does Don need to update them?

Comment 30 Ben Marzinski 2009-07-16 22:01:01 UTC

What version of multipath were you using when you tested these two kernels?  I can't think of any changes to the multipath tools would have effected this, but I'll take a look. and see if I can find something.

Comment 31 Don 2009-07-20 17:49:47 UTC

I am getting the same behavior with S1 kernel 156 as I did with Beta1 kernel 155, comment #25.


scsi_dh_alua does not autoload at all, during the boot or after a command with 155 or 156.

When I was using Kernel 153 it loaded at boot time.

Comment 32 Mike Christie 2009-07-20 19:11:11 UTC

(In reply to comment #31)
> I am getting the same behavior with S1 kernel 156 as I did with Beta1 kernel
> 155, comment #25.
> 
> 
> scsi_dh_alua does not autoload at all, during the boot or after a command with
> 155 or 156.
> 
> When I was using Kernel 153 it loaded at boot time.  

I do not think we have supported it getting loaded at boot time automatically. You would have had to modify the mkinitrd by hand, or /sbin/multipath getting run and setting up the kernel parts successfully would have loaded it. Maybe the multipath failures in comment #25 are preventing it from getting loaded.


Ben, do you think this bz
https://bugzilla.redhat.com/show_bug.cgi?id=490633
is related to Don's mpath_prio_alua not enough space errors?

Comment 33 Don 2009-07-21 11:45:13 UTC

>I do not think we have supported it getting loaded at boot time automatically.
>You would have had to modify the mkinitrd by hand, or /sbin/multipath getting
>run and setting up the kernel parts successfully would have loaded it.

I did not modify or manualy do any of this to get it to load with kernel 153.

Comment 34 Ben Marzinski 2009-07-22 16:12:33 UTC

The "not enough space" message simply means that multipath didn't allocate enough space for the error message that mpath_prio_alua printed when it failed. I should bump that up, trim the error message.

Could you try manually running

# /sbin/mpath_prio_alua -v <device_name>
# echo $?

This will print debugging messages and the error message that was too big for multipath, which will help in figuring out what going wrong with the callout.

Comment 35 Don 2009-07-22 18:23:56 UTC

[root@cranmore ~]# /sbin/mpath_prio_alua -v /dev/sdf
Target port groups are implicitly and explicitly supported.
Reported target port group is 1 [active/optimized]
50
[root@cranmore ~]# echo $?
0

Comment 36 Mike Christie 2009-07-23 17:18:32 UTC

I now have a setup. Bad news is that I am not seeing this bug.

(In reply to comment #33)
> >I do not think we have supported it getting loaded at boot time automatically.
> >You would have had to modify the mkinitrd by hand, or /sbin/multipath getting
> >run and setting up the kernel parts successfully would have loaded it.
> 
> I did not modify or manualy do any of this to get it to load with kernel 153.  

Yeah, I was wrong here There is one other option. When the FC driver (qla2xxx, lpfc, etc) gets loaded and scsi_devices are added, some userspace helper is running the multipath program and that ends up loading scsi_dh_alua without any user intervention.

I used the multipath.conf in your previous comment with the fix suggested from Ben to use the full path. I then just load qla2xxx and it all works automagically for me.


I will build a kernel with some more debugging info and see if maybe somehting is going on when the multipath device is getting setup in dm-mpath.c.

Comment 37 Don 2009-07-24 19:14:59 UTC

I am not sure if something was different in S1 or I had done somthing different but S3 seems to be working from my preliminary testing.

Comment 38 Don 2009-07-28 11:51:48 UTC

After more testing it appears to be doing an implicit trespass as opposed to an explicit.
I have tried different combinations of parameters in the /etc/mutipath.conf file but nothing seems to change. I am going to get a trace to prove my suspensions.

My question is, are there any recommended parameters that should be put into the mutipath.conf file for a Clariion in Alua mode to do an explicit trespass.

Comment 39 Don 2009-07-28 16:17:23 UTC

upon further investigation including a trace we are seeing this in the messages file 
Jul 28 09:39:11 cranmore kernel: device-mapper: multipath: Failing path 8:48.
Jul 28 09:39:11 cranmore kernel: sd 5:0:0:0: alua: port group 02 state N supports toUsNA
Jul 28 09:39:11 cranmore kernel: sd 6:0:1:0: alua: port group 02 state N supports toUsNA
Jul 28 09:39:11 cranmore kernel: device-mapper: multipath: Failing path 8:64.
Jul 28 09:44:43 cranmore kernel: sd 5:0:1:0: alua: port group 01 state A supports toUsNA
Jul 28 09:44:43 cranmore kernel: sd 6:0:0:0: alua: rtpg failed with 10000
Jul 28 09:44:43 cranmore kernel: sd 5:0:1:0: alua: port group 01 state A supports toUsNA
Jul 28 09:44:43 cranmore kernel: sd 6:0:0:0: alua: rtpg failed with 10000
Jul 28 09:44:49 cranmore kernel: sd 6:0:0:0: alua: port group 01 state A supports toUsNA

In the trace there are no set commands.

The test is running IO down the paths to a lun on SPA then we pull all cables to spa. The trace shows that the IO after about 45 seconds just changes paths.
Eventually after enough IO the Array will do the tresspass.

Let me know if you want the trace made available.

Comment 40 Mike Christie 2009-07-28 16:51:37 UTC

(In reply to comment #39)
> upon further investigation including a trace we are seeing this in the messages
> file 
> Jul 28 09:39:11 cranmore kernel: device-mapper: multipath: Failing path 8:48.
> Jul 28 09:39:11 cranmore kernel: sd 5:0:0:0: alua: port group 02 state N
> supports toUsNA
> Jul 28 09:39:11 cranmore kernel: sd 6:0:1:0: alua: port group 02 state N
> supports toUsNA
> Jul 28 09:39:11 cranmore kernel: device-mapper: multipath: Failing path 8:64.
> Jul 28 09:44:43 cranmore kernel: sd 5:0:1:0: alua: port group 01 state A
> supports toUsNA
> Jul 28 09:44:43 cranmore kernel: sd 6:0:0:0: alua: rtpg failed with 10000

We did not explicitly fail over the device, because the REPORT TARGET GROUP STATES command failed. It looks like it did not even get sent. 10000 is DID_NO_CONNECT, which we get when the transport connection is gone or if the device is in a bad state (bad OS state like it is being removed and we are tearing down its structs or if the scsi eh has offlined the device).

Could you just do

cat /sys/block/sdX/device/state

sdX should be the device you are trying to send the failover commands to. So in the above example it would be 6:0:0:0.

Comment 41 Mike Christie 2009-07-28 16:55:34 UTC

(In reply to comment #40)
> Could you just do
> 
> cat /sys/block/sdX/device/state
> 
> sdX should be the device you are trying to send the failover commands to. So in
> the above example it would be 6:0:0:0 (these values are the Host:Channel:Target_id:LUN).

Could you also print out the port state for the rport the device is on?

This is in /sys/class/fc_remote_port/rport-H:C-R/port_state

The rport values are Host:Channel-Rport_id

So for the example above you want the port states for

rport-6:0-*

(* there should be multiple ports).

Comment 42 Don 2009-07-28 18:37:16 UTC

could not fing the state in cat /sys/block/sdX/device/state where sdX was 6:0:0:0
I did find it in these places

cat /sys/block/sdg/device/scsi_device:6:0::0:1/device/state
running

cat /sys/bus/scsi/devices/6:0:0:0/state
running

cat /sys/bus/scsi/drivers/sd/6:0:0:0/state
running

I did find these

cat /sys/class/fc_remote_ports/rport-6:0-0/port_state
Online

cat /sys/class/fc_remote_ports/rport-6:0-1/port_state
Online

Comment 43 Mike Christie 2009-07-28 18:56:01 UTC

Could you attach /var/log/messages? I just want to see when device are getting added and multipath is getting setup so I can see the initial messages like this:

Jul  9 15:13:46 cranmore kernel: sd 5:0:0:1: alua: port group 02 rel port 0f
Jul  9 15:13:46 cranmore kernel: sd 5:0:0:1: alua: port group 02 state A
supports toUsNA

And I want to see the log for when you try to do the failover.

And could you run

multipath -ll

and send that output.

Comment 44 Mike Christie 2009-07-28 19:19:35 UTC

 (In reply to comment #40)
> (In reply to comment #39)
> > upon further investigation including a trace we are seeing this in the messages
> > file 
> > Jul 28 09:39:11 cranmore kernel: device-mapper: multipath: Failing path 8:48.
> > Jul 28 09:39:11 cranmore kernel: sd 5:0:0:0: alua: port group 02 state N
> > supports toUsNA
> > Jul 28 09:39:11 cranmore kernel: sd 6:0:1:0: alua: port group 02 state N
> > supports toUsNA
> > Jul 28 09:39:11 cranmore kernel: device-mapper: multipath: Failing path 8:64.
> > Jul 28 09:44:43 cranmore kernel: sd 5:0:1:0: alua: port group 01 state A
> > supports toUsNA

This device looks like it is already in the optimized state, so we would not send a stpg for it.

Do you have multiple luns here? Does it work with just one? Could we start small to make this easier?

Comment 45 Don 2009-07-28 19:28:23 UTC

It doesn't get much smaller. I have 2 luns one on SPA one on SPB with 4 paths. I send IO only to the lun on SPA and pull the cables to that SP.

Comment 46 Mike Christie 2009-07-28 19:40:16 UTC

Ok, just send me the logs and multipath -ll output.

Comment 47 Mike Christie 2009-07-29 04:11:04 UTC

I think I found the problem. I sent a patch upstream here:
http://marc.info/?l=linux-scsi&m=124884050116882&w=2

I am making some test kernels. The should be done in a couple hours, but I will be fast asleep :) I will upload and send a link when I wake up.

Comment 48 Mike Christie 2009-07-29 06:57:38 UTC

(In reply to comment #47)
> I am making some test kernels. The should be done in a couple hours, but I will
> be fast asleep :)

Sleep is for wimps :) Here is a test kernel:
http://people.redhat.com/mchristi/scsi/alua/tmp/

Comment 49 Don 2009-07-29 12:46:57 UTC

Well I loaded the kernel and as for the trespass as soon as the failed path shows up in the log it is almost immediate this is a good thing. However after I put the cables back in I no longer get failback and that was working fine on the last kernel. Here is the scenario and the log.  

Remove cables

Jul 29 08:05:30 cranmore kernel:  rport-5:0-0: blocked FC remote port time out: saving binding
Jul 29 08:05:30 cranmore kernel: sd 5:0:0:1: SCSI error: return code = 0x00010000
Jul 29 08:05:30 cranmore kernel: end_request: I/O error, dev sdc, sector 1886144
Jul 29 08:05:30 cranmore kernel: device-mapper: multipath: Failing path 8:32.
Jul 29 08:05:30 cranmore multipathd: dm-3: add map (uevent) 
Jul 29 08:05:30 cranmore multipathd: dm-3: devmap already registered 
Jul 29 08:05:30 cranmore multipathd: 8:32: mark as failed 
Jul 29 08:05:30 cranmore multipathd: mpath1: remaining active paths: 3 
Jul 29 08:05:31 cranmore multipathd: sdb: readsector0 checker reports path is down 
Jul 29 08:05:31 cranmore multipathd: checker failed path 8:16 in map mpath2 
Jul 29 08:05:31 cranmore multipathd: mpath2: remaining active paths: 3 
Jul 29 08:05:31 cranmore kernel: device-mapper: multipath: Failing path 8:16.
Jul 29 08:05:31 cranmore multipathd: sdc: readsector0 checker reports path is down 
Jul 29 08:05:31 cranmore multipathd: dm-2: add map (uevent) 
Jul 29 08:05:31 cranmore multipathd: dm-2: devmap already registered 
Jul 29 08:05:33 cranmore kernel:  rport-6:0-1: blocked FC remote port time out: saving binding
Jul 29 08:05:33 cranmore kernel: sd 6:0:1:1: SCSI error: return code = 0x00010000
Jul 29 08:05:33 cranmore kernel: end_request: I/O error, dev sdi, sector 1886144
Jul 29 08:05:33 cranmore kernel: device-mapper: multipath: Failing path 8:128.
Jul 29 08:05:33 cranmore multipathd: dm-3: add map (uevent) 
Jul 29 08:05:33 cranmore multipathd: dm-3: devmap already registered 
Jul 29 08:05:33 cranmore multipathd: 8:128: mark as failed 
Jul 29 08:05:33 cranmore multipathd: mpath1: remaining active paths: 2 
Jul 29 08:05:33 cranmore kernel: activate
Jul 29 08:05:33 cranmore kernel: alua_activeate
Jul 29 08:05:33 cranmore kernel: sd 5:0:1:1: alua: port group 02 state N supports toUsNA
Jul 29 08:05:33 cranmore kernel: alua_activeate stpg
Jul 29 08:05:33 cranmore kernel: sd 5:0:1:1: alua: port group 02 switched to state A
Jul 29 08:05:33 cranmore kernel: stpg err 0
Jul 29 08:05:33 cranmore kernel: activate
Jul 29 08:05:33 cranmore kernel: alua_activeate
Jul 29 08:05:33 cranmore kernel: sd 6:0:0:1: alua: port group 02 state A supports toUsNA
Jul 29 08:05:33 cranmore kernel: stpg err 0
Jul 29 08:05:34 cranmore multipathd: sdh: readsector0 checker reports path is down 
Jul 29 08:05:34 cranmore kernel: device-mapper: multipath: Failing path 8:112.
Jul 29 08:05:34 cranmore multipathd: checker failed path 8:112 in map mpath2 
Jul 29 08:05:34 cranmore multipathd: mpath2: remaining active paths: 2 
Jul 29 08:05:34 cranmore multipathd: sdi: readsector0 checker reports path is down 
Jul 29 08:05:34 cranmore multipathd: dm-2: add map (uevent) 
Jul 29 08:05:34 cranmore multipathd: dm-2: devmap already registered 
Jul 29 08:05:36 cranmore multipathd: sdb: readsector0 checker reports path is down 
Jul 29 08:05:36 cranmore multipathd: sdc: readsector0 checker reports path is do

Insert Cables

Jul 29 08:08:16 cranmore multipathd: 8:16: reinstated 
Jul 29 08:08:16 cranmore multipathd: mpath2: remaining active paths: 3 
Jul 29 08:08:16 cranmore multipathd: sdc: readsector0 checker reports path is up 
Jul 29 08:08:16 cranmore multipathd: 8:32: reinstated 
Jul 29 08:08:16 cranmore multipathd: mpath1: remaining active paths: 3 
Jul 29 08:08:16 cranmore multipathd: dm-2: add map (uevent) 
Jul 29 08:08:16 cranmore multipathd: dm-2: devmap already registered 
Jul 29 08:08:16 cranmore multipathd: dm-3: add map (uevent) 
Jul 29 08:08:16 cranmore multipathd: dm-3: devmap already registered 
Jul 29 08:08:19 cranmore multipathd: sdh: readsector0 checker reports path is down 
Jul 29 08:08:19 cranmore multipathd: sdi: readsector0 checker reports path is down 
Jul 29 08:08:24 cranmore multipathd: sdh: readsector0 checker reports path is up 
Jul 29 08:08:24 cranmore multipathd: 8:112: reinstated 
Jul 29 08:08:24 cranmore multipathd: mpath2: remaining active paths: 4 
Jul 29 08:08:24 cranmore multipathd: sdi: readsector0 checker reports path is up 
Jul 29 08:08:24 cranmore multipathd: 8:128: reinstated 
Jul 29 08:08:24 cranmore multipathd: mpath1: remaining active paths: 4 
Jul 29 08:08:24 cranmore multipathd: dm-2: add map (uevent) 
Jul 29 08:08:24 cranmore multipathd: dm-2: devmap already registered 
Jul 29 08:08:24 cranmore multipathd: dm-3: add map (uevent) 
Jul 29 08:08:24 cranmore multipathd: dm-3: devmap already registered 


Wait forever (1 hour) no failback

Comment 50 Mike Christie 2009-07-29 13:30:41 UTC

When it was doing implicit failover, and it finally trespasses does the alua state spit out by mpath_prio_alua -v /dev/XYZ say active/optimized or does it stay in active/non-optimized?

If I use the told kernel and if I kill paths and renable them before the device does the implicit tresspass (state of paths being used is still active/non-optimized) , then failack works.

If I use the told kernel and if I kill paths and renable them after the device does the implicit tresspass (state of paths being used is still active/optimized), then failback fails like with the new kernel.


Then if the the path does get transitioned/tresspassed, if I do

multipath -F

then do

multipath

The initial paths being used are the ones that were in active/optimized when I ran multipath -F (so it was the ones that we ended up transitioning to when I pulled the cables).


I think multipathd is doing this on purpose, because of what mpath_prio_alua is returning.

Ben, I have a box with this setup and running right now at
pnate.lab.bos.redhat.com

(username root, password is the normal boston lab one).

Can you confirm this?

Comment 53 Mike Christie 2009-07-29 14:20:38 UTC

If I use

prio_callout            "/sbin/mpath_prio_emc /dev/%n"


for the prio callout (but still use

hardware_handler        "1 alua"

for the hwhandler) then failback and device setup works as expected.

Comment 54 Don 2009-07-29 15:11:02 UTC


>When it was doing implicit failover, and it finally trespasses does the alua
>state spit out by mpath_prio_alua -v /dev/XYZ say active/optimized or does it
>stay in active/non-optimized?

Before the cable pull

[root@cranmore ~]# mpath_prio_alua -v /dev/sdc
Target port groups are implicitly and explicitly supported.
Reported target port group is 1 [active/optimized]
50
[root@cranmore ~]# mpath_prio_alua -v /dev/sde
Target port groups are implicitly and explicitly supported.
Reported target port group is 2 [active/non-optimized]
10
[root@cranmore ~]# mpath_prio_alua -v /dev/sdg
Target port groups are implicitly and explicitly supported.
Reported target port group is 2 [active/non-optimized]
10
[root@cranmore ~]# mpath_prio_alua -v /dev/sdi
Target port groups are implicitly and explicitly supported.
Reported target port group is 1 [active/optimized]
50

After the cable pull and trespass

[root@cranmore ~]# mpath_prio_alua -v /dev/sdc
 
[root@cranmore ~]# mpath_prio_alua -v /dev/sde
Target port groups are implicitly and explicitly supported.
Reported target port group is 2 [active/optimized]
50
[root@cranmore ~]# mpath_prio_alua -v /dev/sdg
Target port groups are implicitly and explicitly supported.
Reported target port group is 2 [active/optimized]
50
[root@cranmore ~]# mpath_prio_alua -v /dev/sdi

After cable plug in and failback occured.
 
[root@cranmore ~]# mpath_prio_alua -v /dev/sdc
Target port groups are implicitly and explicitly supported.
Reported target port group is 1 [active/non-optimized]
10
[root@cranmore ~]# mpath_prio_alua -v /dev/sde
Target port groups are implicitly and explicitly supported.
Reported target port group is 2 [active/optimized]
50
[root@cranmore ~]# mpath_prio_alua -v /dev/sdg
Target port groups are implicitly and explicitly supported.
Reported target port group is 2 [active/optimized]
50
[root@cranmore ~]# 
[root@cranmore ~]# mpath_prio_alua -v /dev/sdi
Target port groups are implicitly and explicitly supported.
Reported target port group is 1 [active/non-optimized]
10

Comment 55 Don 2009-07-29 15:16:09 UTC

 I have benn useing 

prio_callout            "/sbin/mpath_prio_alua /dev/%n"

and

hardware_handler        "1 alua"

I will try and use

prio_callout            "/sbin/mpath_prio_emc /dev/%n"



Hence my question in comment #38

Comment 56 Don 2009-07-29 16:12:35 UTC

I tried prio_callout   "/sbin/mpath_prio_emc /dev/%n"
and it did not work unless I had 
failback    immediate

Is there anything else in this device attributed list you would suggest, this is the one I currently have working for alua.

devices {
#      Device attributed for EMC CLARiiON
        device {
#                vendor                  "DGC"
#                product                 "*"
#                path_grouping_policy    group_by_prio
#                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                 prio_callout            "/sbin/mpath_prio_emc /dev/%n"
#                prio_callout            "/sbin/mpath_prio_alua /dev/%n"
#                path_checker            emc_clariion 
#               path_checker            alua  
#               features                "1 queue_if_no_path"                   
#               hardware_handler        "1 emc"
                hardware_handler        "1 alua"
                failback                immediate
#               no_path_retry           12
        }
}

Comment 57 Ben Marzinski 2009-07-29 16:54:24 UTC

I'm confused about comments #49 and #50.  If you've already trespassed, why would you want to failback (at the risk of showing my clarrion ignorance)?

Comment 58 Don 2009-07-29 17:05:52 UTC

Part of the products features (and this is a basic explaination) is when you loose all paths to a device on one sp it trespasses from that SP to the other sp that still has good connections. Once the connections to the first sp are repaired the devices are supose to failback to there original sp. That is part of our testing. There are various ways that this is done but I won't get into them.

Comment 59 Ben Marzinski 2009-07-29 18:11:49 UTC

/sbin/mpath_prio_alua checks the asymmetric access state of the target port group. It doesn't care about the preference bit. So, it will not failback from the optimized paths.

/sbin/mpath_prio_emc bases it's priority on whether the path is the default owner, so this should work the way you want it to. Of course, as you saw, the failback will only happen immediately after the paths are restored if failback is set to immediate, since the failback option defaults to manual.

Also, I just want to double check that most of the device fields (including the essential vendor and product fields) were commented out to highlight parts of your configuration.  I can't imagine that this would work correctly if you actually had your vendor and product lines commented out.  multipath wouldn't know what devices to use that configuration on.

Comment 60 Zhang Kexin 2009-07-30 05:37:15 UTC

patch is in kernel-2.6.18-159, add SanityOnly.

Comment 61 Wayne Berthiaume 2009-07-30 12:30:09 UTC

In reference to comment #59, Hannes pushed a patch to use the pref bit in the alua handler recently so it will return the LUN to its original owner after a trespass has occurred and the fault has been resolved.

Comment 62 Don 2009-07-30 12:39:10 UTC

Do you know what kernel that will be or what snapshot? I am testing 2.6.18-160 and it is not working there.

Comment 63 Andrius Benokraitis 2009-07-30 13:57:54 UTC

Since it is is POST the latest patch hasn't been committed.j

Comment 67 Mike Christie 2009-07-30 16:56:04 UTC

(In reply to comment #61)
> In reference to comment #59, Hannes pushed a patch to use the pref bit in the
> alua handler recently so it will return the LUN to its original owner after a
> trespass has occurred and the fault has been resolved.  

Hey Wayne,

Did he push that to SLES or upstream? If upstream was this a patch for multipath-tools and the patch go to the dm-devel list?

I did not see any kernel stuff on linux-scsi.

I did a quick search on dm-devel (only checked the email subjects), but did not see anything there either.

Comment 68 Tom Coughlan 2009-07-30 21:18:08 UTC

Hi Don, Wayne,

Can I ask you to draft a release note for 5.4, letting customers know what to expect when they upgrade from 5.3, and what new functionality they get in 5.4? Post a draft here and maybe we can refine it together. Just a brief user-0level view. 

Thanks,

Tom

Comment 69 Ben Marzinski 2009-07-30 23:10:09 UTC

the alua priority checker works the same in upstream and rhel.

Comment 70 Don 2009-07-31 12:04:35 UTC

I would put in my findings but I would need to know first if the patch in comment #61 is going to make GA and I would have to test it first before I could say what the customer is going see.

Comment 71 Don 2009-07-31 12:21:15 UTC

An S4 came out today does anyone know if the patch from comment #61 made it into it?

Comment 72 Tom Coughlan 2009-07-31 13:35:34 UTC

(In reply to comment #71)
> An S4 came out today does anyone know if the patch from comment #61 made it
> into it?  

No, it is not in snap 4 (or 5). The patch from comment 61 is making its way into the rc kernel right now.

Comment 73 Mike Christie 2009-07-31 15:34:27 UTC

(In reply to comment #72)
> (In reply to comment #71)
> > An S4 came out today does anyone know if the patch from comment #61 made it
> > into it?  
> 
> No, it is not in snap 4 (or 5). The patch from comment 61 is making its way
> into the rc kernel right now.  

It is actually a different patch.

The patch in comment #47 is making its way into the rc kernel right now.

The patch in comment #61 is somewhere in SUSE. The guy Wayne is referring to, Hannes, works at SUSE and when he says pushed he means pushed to SUSE or upstream. And we have no idea if it is a kernel or userspace patch. I looked in the kernel and kernel lists and did not see it (comment #67) and it looks like Ben checked the upstream multipath-tools code and did not find it (comment #69).

Comment 74 Don 2009-07-31 15:52:02 UTC

So is this dirrerent patch going to use the pref bit in /sbin/mpath_prio_alua
so tat it will fail back after a restored path?

Comment 75 Ben Marzinski 2009-07-31 16:49:31 UTC

I looked in Hannes' git tree. The alua patch is there, not upstream.  We can certainly do the same thing.  However, it's pretty late to be changing a prioritizer for 5.4. Is there some reason why mpath_prio_emc doesn't get you want you want.  There are a lot of different devices that use mpath_prio_alua.  I'm not  totally sure that this is the right thing to do for all of them, and the clarrion already defaults to using mpath_prio_emc and immediate failback.

Comment 76 Don 2009-07-31 17:27:26 UTC

That's fine we can release note it. My testing went fine in Alua mode using these in my multipath.conf
devices {
#      Device attributed for EMC CLARiiON
        device {
                vendor                  "DGC"
                product                 "*"
                path_grouping_policy    group_by_prio
                prio_callout     "/sbin/mpath_prio_emc /dev/n"                 
                hardware_handler        "1 alua"
                failback                immediate
        }
}  
I know you don't need them all but it made the output of multipath -ll look closer to PNR mode.

With PNR I did not modify the default multipath.conf at all.

I don't want to miss 5.4 with Alua support and I hope that at some point we can sync up with SLES with some kind of standard.

Comment 77 Ben Marzinski 2009-07-31 18:15:47 UTC

Have you tried just removing this section, and using the compiled in defaults, they are

       device {
               vendor                  "DGC"
               product                 ".*"
               product_blacklist       "LUNZ"
               getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
               prio_callout            "/sbin/mpath_prio_emc /dev/%n"
               features                "1 queue_if_no_path"
               hardware_handler        "1 emc"
               path_grouping_policy    group_by_prio
               failback                immediate
               rr_weight               uniform
               no_path_retry           60
               rr_min_io               1000
               path_checker            emc_clariion
       }

This is exactly the same as they are in upstream. If these aren't optimal, I'd like to know, so we can change them.  Specifically, don't we want to use the emc_clariion path checker?0

Comment 78 Don 2009-07-31 18:22:21 UTC

In alua mode I run into different issues without at least these three I think they have all been discused in the bug.
                prio_callout     "/sbin/mpath_prio_emc /dev/n"                 
                hardware_handler        "1 alua"
                failback                immediate

Comment 79 Wayne Berthiaume 2009-07-31 19:14:00 UTC

For comment #67. 

Hi Mike, the patch was in NC BZ #501729. You're in the CC. =;^)

Regards,
Wayne.

Comment 80 Don Zickus 2009-08-05 14:08:39 UTC

in kernel-2.6.18-162.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 83 Don 2009-08-07 15:14:31 UTC

Here is a detailed short synopsis of my findings with S4 as well as kernel 162


MPIO Status with explicit Alua

ALUA mode these additions must be made in the /etc/multipath.conf

hardware_handler	“1 alua”  (this is necessary to detect alua mode and 
				   do an explicit trespass)

prio_callout     "/sbin/mpath_prio_emc /dev/n"   (this is nessary so luns do 
                                                  not trespass during a reboot)
 

failback                immediate  (this is necessary so luns will failback to 
				      to thier default SP after a failed path 
                                       has been restored)

Comment 84 Rob Evers 2009-08-07 18:56:43 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
(derived from comment 83 below)

MPIO Status with explicit Alua

Using ALUA mode. these additions must be made in the /etc/multipath.conf

hardware_handler “1 alua”  (this is necessary to detect alua mode and do an explicit trespass)

prio_callout "/sbin/mpath_prio_emc /dev/n"   (this is necessary so luns do not trespass during a reboot)

failback immediate (this is necessary so luns will failback to to thier default SP after a failed path has been restored)

Comment 88 Tom Coughlan 2009-08-24 03:07:50 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,8 +1,18 @@
-(derived from comment 83 below)
+This is a "new feature" release note. 
 
-MPIO Status with explicit Alua
+I am including some background info., so customers will understand what is new in 5.4. The detailed instructions follow that. 
 
-Using ALUA mode. these additions must be made in the /etc/multipath.conf
+Does this look okay Mike?
+ 
+-------------
+
+Support for Explicit ALUA (Asymmetric Logical Unit Access) in device-mapper-multipath. 
+
+Earlier versions of RHEL 5 support implicit ALUA. This means that the operating system is not aware of which storage device paths have optimized performance and which have non-optimized performance. If the operating system consistently sends I/O on a non-optimized path, then the storage device may transparently make that path optimized, improving performance, and causing  idle paths to become non-optimized.  
+
+RHEL 5.4 introduces explicit ALUA support for Clariion storage. This means that the operating system exchanges information with the storage device and is able to select the paths that have optimized performance.
+
+To make use of ALUA mode. these additions must be made in the /etc/multipath.conf
 
 hardware_handler “1 alua”  (this is necessary to detect alua mode and do an explicit trespass)

Comment 90 Ryan Lerch 2009-08-26 05:21:30 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,21 +1,3 @@
-This is a "new feature" release note. 
+Asymmetric Logical Unit Access (ALUA) support in device-mapper-multipath has been updated, adding explicit ALUA support for Clariion storage. Earlier versions of Red Hat Enterprise Linux 5 added support for implicit ALUA (i.e. the operating system is not aware of which storage device paths have optimized performance and which have non-optimized performance). If the operating system consistently sends I/O on a non-optimized path, then the storage device may transparently make that path optimized, improving performance and causing idle paths to become non-optimized.
 
-I am including some background info., so customers will understand what is new in 5.4. The detailed instructions follow that. 
+Red Hat Enterprise Linux 5.4 introduces explicit ALUA support for Clariion storage (i.e. the operating system exchanges information with the storage device and is able to select the paths that have optimized performance).-
-Does this look okay Mike?
- 
--------------
-
-Support for Explicit ALUA (Asymmetric Logical Unit Access) in device-mapper-multipath. 
-
-Earlier versions of RHEL 5 support implicit ALUA. This means that the operating system is not aware of which storage device paths have optimized performance and which have non-optimized performance. If the operating system consistently sends I/O on a non-optimized path, then the storage device may transparently make that path optimized, improving performance, and causing  idle paths to become non-optimized.  
-
-RHEL 5.4 introduces explicit ALUA support for Clariion storage. This means that the operating system exchanges information with the storage device and is able to select the paths that have optimized performance.
-
-To make use of ALUA mode. these additions must be made in the /etc/multipath.conf
-
-hardware_handler “1 alua”  (this is necessary to detect alua mode and do an explicit trespass)
-
-prio_callout "/sbin/mpath_prio_emc /dev/n"   (this is necessary so luns do not trespass during a reboot)
-
-failback immediate (this is necessary so luns will failback to to thier default SP after a failed path has been restored)

Comment 91 errata-xmlrpc 2009-09-02 08:30:56 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.