666304 – scsi_dh_emc gives "error attaching hardware handler" for EMC active-active SANs

Bug 666304 - scsi_dh_emc gives "error attaching hardware handler" for EMC active-active SANs

Summary: scsi_dh_emc gives "error attaching hardware handler" for EMC active-active SANs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	5.7
Assignee:	Mike Christie
QA Contact:	Storage QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	688234 695435 739900
TreeView+	depends on / blocked

Reported:	2010-12-30 03:09 UTC by Ryan Mitchell
Modified:	2018-11-26 19:27 UTC (History)
CC List:	29 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	695435 (view as bug list)
Environment:
Last Closed:	2011-07-21 09:38:23 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
fix scsi_dh_detach (2.08 KB, patch) 2011-03-26 03:34 UTC, Mike Christie	no flags	Details \| Diff
test kernel 2.6.18-249.el5.scsi_dh_detach0 test results (257.57 KB, text/plain) 2011-03-28 17:57 UTC, Greg Sawicki	no flags	Details
console screenshots for kernel-2.6.18-249.el5.detach2 (2.74 MB, application/zip) 2011-03-30 16:25 UTC, Greg Sawicki	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Legacy)	43774	None	None	None	Never
Red Hat Knowledge Base (Legacy)	53582	None	None	None	Never
Red Hat Product Errata	RHSA-2011:1065	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update	2011-07-21 09:21:37 UTC

Description Ryan Mitchell 2010-12-30 03:09:56 UTC

Created attachment 471134 [details]
Screenshot of boot errors showing multipath device failing to be  created.

Description of problem:
When EMC CX120 SAN is configured in Active-Active mode, device-mapper-multipath fails to correctly create multipath devices.  This SAN can configure paths as Active-Active or Active-Passive mode as required.

On a boot-from-SAN host, the following console log is observed on boot when using default multipath settings for EMC RAID5 SAN and 2.6.18-194.el5 and Active-Active mode:
--------
Red Hat nash version 5.1.19.6 starting
device-mapper: table: 253:0: multipath: error attaching hardware handler
device-mapper: table: 253:0: multipath: error attaching hardware handler
device-mapper: table: 253:0: multipath: error attaching hardware handler
device-mapper: table: 253:0: multipath: error attaching hardware handler
No devices found
Unable to access resume device (/dev/mapper/mpath0p2)
mount: could not find filesystem '/dev/root'
setuproot: moving /dev failed: No such file or directory
setuproot: error mounting /proc: No such file or directory
setuproot: error mounting /sys: no such file or directory
switchroot: mount failed: No such file or directory
Kernel panic - not syncing: Attempted to kill init!
---------

In rescue mode, "multipath -ll" reports the path as:
mpath0 () dm-0 DGC,RAID 5
[size=25G][features=0][hwhandler=0][rw]

On RHEL5.5 when using Active-Passive configuration, the device appears as (this is expected):
mpath0 (3600601601680290014bf991a3d01e011) dm-0 DGC,RAID 5
[size=25G][features=1 queue_if_no_path][hwhandler=1 emc][rw]

Works correctly with RHEL5.4 kernels (2.6.18-164.*), or with RHEL5.5 kernels (2.6.18-194.*) and Active-Passive configured.

Version-Release number of selected component (if applicable):
- Red Hat Enterprise Linux 5.5 (kernel 2.6.18-194.*)
- EMC CX120 SAN configured in Active-Active mode.
- Tested on device-mapper-multipath-0.4.7-34.el5.x86_64 (but hardware handler lives in the kernel)

How reproducible:
Very reproducible with access to hardware. 

Steps to Reproduce:
1.  Install RHEL5.5 server in boot-from-SAN multipath'd configuration from EMC CX120 SAN in Active-Passive mode.
2. Ensure dm-multipath is configured using defaults for EMC RAID5 devices.
3. Change LUN configuration on EMC SAN to "Active-Active" mode.
4. Reboot the host.
  
Actual results:
- Device mapper fails to create multipath device.  Leaves errors:
device-mapper: table: 253:0: multipath: error attaching hardware handler
- System fails to boot as the root filesystem cannot be found due to missing multipath device.

Expected results:
- device-mapper-multipath correctly creates the multipath device as it does in active-passive mode.
- System boots up as it does in active-passive mode without panicking.

Additional info:
- This is reproducible without boot-from-SAN, but our test environment used this.
- RHEL5.4 kernels boot fine in either A-A mode or A-P mode.
- RHEL5.5 kernels fail in A-A mode but succeed in A-P mode.

- As a workaround, we configured dm-multipath not to use the hardware handler, and we were able to boot using RHEL5.5 and A-A mode.  Configuration:
       device {
               vendor                  "DGC"
               product                 ".*"
               product_blacklist       "LUNZ"
               getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
               prio_callout            "/sbin/mpath_prio_emc /dev/%n"
               features                "1 queue_if_no_path"
               # Disable hardware handler for active-active mode
               # hardware_handler        "1 emc"
               hardware_handler        "0"
               path_grouping_policy    group_by_prio
               failback                immediate
               rr_weight               uniform
               no_path_retry           60
               rr_min_io               1000
               path_checker            emc_clariion
       }

Comment 1 Ryan Mitchell 2010-12-30 03:13:41 UTC

New hardware handler scsi_dh_emc was added in RHEL5.5, and it appears to be at the heart of the issue.

http://bugzilla.redhat.com/show_bug.cgi?id=437107
http://people.redhat.com/mchristi/scsi/alua/5.4/v4/0002-RHEL-5.4-Add-scsi_dh_emc.patch

Also, changing component to kernel as that is where the hardware handler lives.

Comment 3 Ryan Mitchell 2010-12-30 03:16:46 UTC

New hardware handler scsi_dh_emc was added in RHEL5.5, and it appears to be at the heart of the issue.

http://bugzilla.redhat.com/show_bug.cgi?id=437107
http://people.redhat.com/mchristi/scsi/alua/5.4/v4/0002-RHEL-5.4-Add-scsi_dh_emc.patch

Also, changing component to kernel as that is where the hardware handler lives.

Comment 5 Mike Christie 2010-12-30 12:53:23 UTC

Can you check that the scsi_dh_emc module is in the initramfs?

Is dm_emc in there?

Comment 6 Mike Christie 2010-12-30 12:58:35 UTC

And is the box in ALUA mode or Tresspass?

Comment 29 Mike Belangia 2011-02-25 14:42:11 UTC

Created attachment 481007 [details]
init filesystem archive

Comment 30 Mike Belangia 2011-02-25 14:49:58 UTC

Created attachment 481009 [details]
The requested initrd

Comment 35 Mike Belangia 2011-03-03 13:51:15 UTC

Created attachment 482082 [details]
Screenshot showing alua errors

Comment 42 Mike Belangia 2011-03-22 13:36:14 UTC

Created attachment 486790 [details]
Video of booting

Comment 44 Greg Sawicki 2011-03-24 18:45:25 UTC

this bug is reproducible also on RHEL 5.6

EMC CX4-40 LUN configured in active-active (ALUA)

server kistarted using RHEL 5.6 distribution media

'mpath' passed to kernel append line to make this kickstart create SAN boot disk

kickstart config file has "part ... --ondisk mapper/mpath0" lines to configure system partitions

install proceeds and completes as it did in 5.4

in /etc/fstab system partition devices to be mounted as /boot or "/" or /var are /dev/mapper/mpath0pX

when system boots it throws errors

table: 253:0: multipath: error attaching hardware handler
ioctl: error adding target to table

We managed to boot it up using partition labels. When system comes up system partitions are /dev/sdXY not dm devices that we would expect. Ofcourse this is wrong and not acceptable.

a quick 

/etc/init.d/multipathd restart

produces these in dmesg:

device-mapper: multipath: version 1.0.5 loaded
device-mapper: multipath round-robin: version 1.0.0 loaded
device-mapper: multipath emc: version 0.0.3 loaded
device-mapper: dm-raid45: initialized v0.2594l
device-mapper: multipath: Using scsi_dh module scsi_dh_emc for failover/failback and device management.
device-mapper: table: 253:0: multipath: error attaching hardware handler
device-mapper: ioctl: error adding target to table


and it keeps repeating

# uname -a
Linux ******* 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64 x86_64 x86_64 GNU/Linux

Comment 45 Mike Christie 2011-03-24 22:57:05 UTC

(In reply to comment #44)
> /etc/init.d/multipathd restart
> 
> produces these in dmesg:
> 
> device-mapper: multipath: version 1.0.5 loaded
> device-mapper: multipath round-robin: version 1.0.0 loaded
> device-mapper: multipath emc: version 0.0.3 loaded

Could you retry the multipathd restart test, but make sure dm-emc does not get loaded. To do this do

rmmod dm-emc
mv /lib/modules/$YOUR_KERNEL_VERSION/kernel/drivers/md/dm-emc.ko some-tmp-place

then rerun the multipathd restart test?



> device-mapper: dm-raid45: initialized v0.2594l
> device-mapper: multipath: Using scsi_dh module scsi_dh_emc for
> failover/failback and device management.
> device-mapper: table: 253:0: multipath: error attaching hardware handler
> device-mapper: ioctl: error adding target to table



Could you send all of the /var/log/messages. Send the messages for the failed restart test and the messages for the restart with dm-emc not loaded. Send it all. I want to see the dm messages like above and see the messages from the scsi layer when the scsi devices are first discovered.

Comment 52 Mike Christie 2011-03-26 02:36:58 UTC

Looks like we are just missing this patch:
http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=6c10db72c94818573552fd71c89540da325efdfb
which fixes the scsi_dh_detach function. This should allow dm-multipath to detach scsi_dh_alua and use scsi_dh_emc's compat mode.


I think for a long term fix we might want to have multipathd determine if it should use emc or alua dynamically. It can do this by sending an inquiry and checking if the device supports alua. If it does use that.

Ben, what do you think about the multipathd alua detection? Do you want to clone this bz?

Comment 53 Mike Christie 2011-03-26 03:34:46 UTC

Created attachment 487713 [details]
fix scsi_dh_detach

Allow kernel to detach scsi_dh module.

Comment 54 Mike Christie 2011-03-26 06:54:41 UTC

Here is a test kernel with the patch.

http://people.redhat.com/mchristi/kernel-2.6.18-249.el5.scsi_dh_detach0.x86_64.rpm
http://people.redhat.com/mchristi/kernel-2.6.18-249.el5.scsi_dh_detach0.i686.rpm
http://people.redhat.com/mchristi/kernel-PAE-2.6.18-249.el5.scsi_dh_detach0.i686.rpm

MikeB, have your customer try this kernel instead of the test suggested in comment 51.

Comment 55 Greg Sawicki 2011-03-28 17:56:16 UTC

I have performed some simple tests with new test kernel. I see no difference, see attached text file

Comment 56 Greg Sawicki 2011-03-28 17:57:21 UTC

Created attachment 488216 [details]
test kernel 2.6.18-249.el5.scsi_dh_detach0 test results

Comment 57 Mike Christie 2011-03-28 19:22:17 UTC

Thanks for testing Greg.

Mike B, hold up on testing.

I think I know what went wrong. Will update bz later today.

Comment 58 Mike Christie 2011-03-29 03:02:22 UTC

Here is a updated kernel:
http://people.redhat.com/mchristi/kernel-2.6.18-249.el5.detach2.x86_64.rpm
I got access to a clarrion here, and tested the patch, so it should work.

Comment 59 Greg Sawicki 2011-03-29 14:51:29 UTC

no go for me with 2.6.18-249.el5.detach2

initrd boot phase clean, boot disk seems to be detected as /dev/sdd (root partition sdd2)
switch to root fs as per 'kernel' line from grub.conf seems to take place, root identified by disk label here

i can see udev start with OK status and little pause and then bunch of errors complaining about fs superblock and other fs parameters as if fs was corrupted, I personally think that at this point it expects root fs to come from/dev/sdd2 while it cannot see this device. Active path changed to one of other /dev/sdX ???

strange thing though is that it starts executing rc.sysinit and throws more errors from there and then it gets stuck

there is no other disks at this point visible to the system, only boot disk but with 4 paths possibly so therefore 4 scsi devices /dev/sd[a-d]

possible hint, it does not tell me anything but it may to Chris:

on original kickstart build with 'mpath' directive it will result in kernel panic on switch from miniroot as it could not find root=/dev/mapper/mpath0p2 as defined in grub.conf

we labeled disk partitions and changed to root=LABEL=/root_mp

it booted this way, and I noticed that initrd for 'mpath' build was twice the size of regular build on local disk. I decided to remake initrd once I was able to boot it up with labels and see what happens. On attempt to boot with new remade initrd system would behave exactly the same way as with 2.6.18-249.el5.detach2 kernel.

Comment 60 Mike Christie 2011-03-29 20:52:25 UTC

Greg,

Could you send the log like you did in comment #56?

Comment 61 RHEL Program Management 2011-03-29 22:39:21 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 62 Greg Sawicki 2011-03-30 16:25:30 UTC

Created attachment 488825 [details]
console screenshots for kernel-2.6.18-249.el5.detach2

I have re-created the entire build process, to take no chances.
Using stock RHEL 5.6 media performed kickstart with 'mpath' directive. In the postinstall section applied 'rpm -i kernel-2.6.18-249.el5.detach2.rpm'

system at this point sees one disk only that is SAN boot disk.

System boots up into kernel-2.6.18-249.el5.detach2 without errors.

system partition references in grub.conf and fstab are /dev/mapper/mpath0pX

reason is that that is what anaconda/installer kernel makes available and that is what is being used in 'part' directive of kickstart.

long term using /dev/mapper/mpath0pX is not sustainable, on the very same system I added another LUN and SAN boot disk (LUN) on next mock up install became mpath1 device.

Attempted to switch to labels by creating custom labels for "/" boot var and tmp partitions respectively san_root, san_boot, san_var and san_tmp. Changed references in grub.conf and fstab to labels and rebooted. Attached screenshots document relevant part of boot. Something still seems to be wrong with devmapper as when it attempts to fsck root fs it goes and grabs /dev/sdd2 (one of the scsi paths to root partition) and fails to repair prompt. In repair mode remounted root fs rw and tried few things, see screenshots. Interesting are attempts to mount /boot and /tmp, it looks like devmapper is hitting inactive path and fails to mount. That is when I suppose it tries to use LABEL from fstab to get to the partition. If /dev/mapper device is given mount succeed.

PUZZLED :(

Comment 63 Mike Christie 2011-03-30 16:52:34 UTC

Greg,

Thanks for the testing. I am going to send the fix to the kernel code to get merged in rhel 5.7.

I think for the label issue we should open a new bz, because we have to do a bz per issue since the bz scripts will automatically close this one when the kernel patch gets merged and released in a new kernel.

Comment 64 Greg Sawicki 2011-03-30 17:11:40 UTC

As mach as merging this patch removes certain conditions resulting in errors but we stil have non-functioning devmapper IMHO, see below:

1. I restored grub.conf and fstab references to system partitions back to using /dev/mapper/mpath0pX - again this notation is not sustainable really longterm as bootdisk may become another mpathX should more luns are presented to the system

however now with one LUN just for testing i can accept this

2. system boots clean

here are however outputs that really concern me:


root@hpwbls025:/root# multipath -ll
mpath0 (360060160c6b0220058e85640c553e011) dm-0 DGC,VRAID
[size=35G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=2][active]
 \_ 0:0:0:0 sda 8:0   [active][ready]
 \_ 1:0:1:0 sdd 8:48  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 0:0:1:0 sdb 8:16  [active][ready]
 \_ 1:0:0:0 sdc 8:32  [active][ready]


root@hpwbls025:/root# for i in a b c d
> do
> fdisk -l /dev/sd$i
> done

Disk /dev/sda: 37.5 GB, 37580963840 bytes
255 heads, 63 sectors/track, 4568 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          65      522081   83  Linux
/dev/sda2              66        2615    20482875   83  Linux
/dev/sda3            2616        3635     8193150   83  Linux
/dev/sda4            3636        4568     7494322+   5  Extended
/dev/sda5            3636        3896     2096451   83  Linux

Disk /dev/sdd: 37.5 GB, 37580963840 bytes
255 heads, 63 sectors/track, 4568 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1   *           1          65      522081   83  Linux
/dev/sdd2              66        2615    20482875   83  Linux
/dev/sdd3            2616        3635     8193150   83  Linux
/dev/sdd4            3636        4568     7494322+   5  Extended
/dev/sdd5            3636        3896     2096451   83  Linux


aha ! he cannot see the boot disk down path 'b' and 'c'

confirmation of the same from dmesg:

Buffer I/O error on device sdb, logical block 0
Buffer I/O error on device sdb, logical block 9175039
Buffer I/O error on device sdb, logical block 0
Buffer I/O error on device sdc, logical block 0
Buffer I/O error on device sdc, logical block 9175039
Buffer I/O error on device sdc, logical block 0


In my opinion this BZ that we attempt to close maybe technically fixes something but devmapper is still left broken.

How do I open BZ against something that does not exists :) I cannot obviously open against this test kernel, can I ?

Comment 65 Mike Christie 2011-03-30 19:53:45 UTC

Hey,

I think I goofed on the bz management. Originally we were debugging the active/active alua problem but I think I ended up fixing a slightly different problem with this bz.

We are getting the "multipath: error attaching hardware handler" then boot failure, because of a kernel bug where if we have scsi_dh_alua attached but userspace is requesting emc, due to that being in the multipath.conf, then dm multipath setup fails when we try to detach alua and attach emc. That is fixed with the patch in this bz, so booting up should work again.

For the case where we are in active active alua mode, and when scsi_dh_emc attaches and detects we are in alua mode, but incorrectly fails IO because it thinks we are in active passive mode then lets use https://bugzilla.redhat.com/show_bug.cgi?id=692239

And for making the alua and tresspass mode detection dynamic so multipath tools just automatically use the correct one, Ben can make a bz for multipath-tools.

Comment 66 Mike Christie 2011-03-30 20:24:09 UTC

(In reply to comment #65)
> For the case where we are in active active alua mode, and when scsi_dh_emc
> attaches and detects we are in alua mode, but incorrectly fails IO because it
> thinks we are in active passive mode then lets use
> https://bugzilla.redhat.com/show_bug.cgi?id=692239
> 
> And for making the alua and tresspass mode detection dynamic so multipath tools
> just automatically use the correct one, Ben can make a bz for multipath-tools.

Actually, Ben, did you want to just use bz 692239? I am not sure if there is anything I can do. If userspace tells us to use the emc module when we are in alua active active mode, then it will screw us up, so we want userspace to do the right thing.

Comment 68 Jarod Wilson 2011-04-04 21:59:15 UTC

Patch(es) available in kernel-2.6.18-255.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 70 Greg Sawicki 2011-04-05 13:50:10 UTC

(In reply to comment #68)
> Patch(es) available in kernel-2.6.18-255.el5
> You can download this test kernel (or newer) from
> http://people.redhat.com/jwilson/el5
> Detailed testing feedback is always welcomed.

SAN boot disk from CLARiiON with ALUA active-active and multipath works fine under condition that hardware handler is changed from default to 'alua':

devices {
        device {
                vendor "DGC"
                product ".*"
                product_blacklist LUNZ
                path_grouping_policy group_by_prio
                getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
                path_selector "round-robin 0"
                path_checker emc_clariion
                features "1 queue_if_no_path"
                hardware_handler "1 alua"
                prio_callout "/sbin/mpath_prio_alua /dev/%n"
                failback immediate
                rr_weight uniform
                no_path_retry 60
                rr_min_io 1000
        }

Have not tested anything else but noticed that once online the system spews this:

be2net 0000:04:00.0: Error in cmd completion - opcode 121, compl 2, extd 30

It does that with the frequency of human heartbeat :)

Comment 71 Mike Christie 2011-04-06 16:34:30 UTC

(In reply to comment #66)
> Actually, Ben, did you want to just use bz 692239? I am not sure if there is
> anything I can do.

Ben, ignore this for now. I think we are going to try to make the kernel attachment more dynamic which should solve the problem here.

Comment 72 Wayne Berthiaume 2011-04-06 21:05:09 UTC

Hi Mike, are you thinking on the lines of an auto-discovery/configuration?

Comment 73 Mike Christie 2011-04-07 18:52:20 UTC

(In reply to comment #72)
> Hi Mike, are you thinking on the lines of an auto-discovery/configuration?

Sort of.

I think the idea is that the scsi_dh modules will try to figure out what is best. This would replace the behavior where it just depends on what scsi_dh module gets loaded first.

So I think the patches would do something like this:

1. Send inquiry, and if ALUA is supported use that.
2. Try device specific. For EMC basically call clariion_bus_attach.

Do you see any problems?

Comment 74 Jerry Levy 2011-04-07 19:59:45 UTC

I'd suggest checking PNR first, and if not supported, then use ALUA. That would eliminate any weirdness which might occur if older arrays (which don't support ALUA) are probed.

Comment 75 Mike Christie 2011-04-07 20:19:05 UTC

(In reply to comment #74)
> I'd suggest checking PNR first, and if not supported, then use ALUA. That would
> eliminate any weirdness which might occur if older arrays (which don't support
> ALUA) are probed.

Ok. Makes sense.

If the device indicates it supports both like in this bz would we want to use ALUA?

Comment 76 Mike Christie 2011-04-07 20:24:21 UTC

(In reply to comment #75)
> (In reply to comment #74)
> > I'd suggest checking PNR first, and if not supported, then use ALUA. That would
> > eliminate any weirdness which might occur if older arrays (which don't support
> > ALUA) are probed.
> 
> Ok. Makes sense.
> 
> If the device indicates it supports both like in this bz would we want to use
> ALUA?

To give some more info here. In this bz we are loading scsi_dh_emc first, and so clariion_bus_attach runs and determines it is ok for it to attach to the device. It however, does hit that ALUA check in parse_sp_info_reply and sees that the device supports ALUA. But scsi_dh_emc decides to stay in PNR mode.


The problem is then that the device is active-active alua mode, and scsi_dh_emc basically does not know what that is. scsi_dh_emc sees one path as CLARIION_LUN_OWNED and the other as CLARIION_LUN_BOUND. In clariion_prep_fn, if the lun state is not CLARIION_LUN_OWNED it will fail the IO, and so users are seeing IO failed.

Comment 77 Mike Christie 2011-04-07 20:28:21 UTC

So I guess I am saying I am not sure if there is a bug in scsi_dh_emc or if I should be looking for something in inquiry data differently in scsi_dh_emc for this case?

Comment 78 Jerry Levy 2011-04-07 20:41:05 UTC

I can check into the inquiry returns on Monday. Let me get a look at the logic in clariion_bus_attach and see -- we might have to do something like "if we see we're already ALUA-qualified, then warn the user and go for ALUA". But that of course might be interesting if the user has set up a custom config thinking they're in PNR mode... ugh.

Comment 80 Jerry Levy 2011-04-11 12:44:04 UTC

In the standard inquiry return, byte 5 bits 4 and 5 (zero-relative) contain the TPGS status. They'll be 00 on a PNR array and 11 on an ALUA box. The lun_state information can be ambiguous as 0x01 and 0x02 are returned in both ALUA and PNR modes but means different things;

0x01 (PNR)  LUN bound and assigned to the OTHER SP 
0x01 (ALUA) LUN bound and using the non-optimized path

0x02 (PNR)  LUN bound and assigned to THIS SP 
0x02 (ALUA) LUN bound and using the optimized path 

So I'd say we check TPGS first and then we can interpret the lun_state accordingly based on that info...

Comment 81 Wayne Berthiaume 2011-04-11 14:47:26 UTC

(In reply to comment #75)

I should caution you that an array can support either ALUA or PNR _and_ could be user configured to either. Arrays prior to the newer VNX series were default configured to PNR and user selectable to ALUA. The new VNX series array is nowt he opposite - default configured to ALUA and user selectable to PNR.

Comment 82 Jerry Levy 2011-04-11 15:06:15 UTC

Correct. The TPGS inquiry return is set per initiator (it changes with the failovermode setting of 1 or 4). So it really does depend on correct user setup, but if we probe it up front we should always return the right data.

Comment 83 Mike Christie 2011-04-11 18:05:12 UTC

Thanks Jerry and Wayne.

Comment 85 errata-xmlrpc 2011-07-21 09:38:23 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html

Note You need to log in before you can comment on or make changes to this bug.

agk
bdonahue
berthiaume_wayne
bmarzins
bmr
branto
christophe.varoqui
dhoward
dwysocha
gerhard.wichert
heinzm
junichi.nomura
jwilleford
kchoi
kueda
levy_jerome
lmb
martin.wilck
mbelangia
mbroz
mchristi
mgoodwin
plyons
prajnoha
prockai
qcai
rdassen
sawickig
stanislav.polasek