Created attachment 471134 [details] Screenshot of boot errors showing multipath device failing to be created. Description of problem: When EMC CX120 SAN is configured in Active-Active mode, device-mapper-multipath fails to correctly create multipath devices. This SAN can configure paths as Active-Active or Active-Passive mode as required. On a boot-from-SAN host, the following console log is observed on boot when using default multipath settings for EMC RAID5 SAN and 2.6.18-194.el5 and Active-Active mode: -------- Red Hat nash version 5.1.19.6 starting device-mapper: table: 253:0: multipath: error attaching hardware handler device-mapper: table: 253:0: multipath: error attaching hardware handler device-mapper: table: 253:0: multipath: error attaching hardware handler device-mapper: table: 253:0: multipath: error attaching hardware handler No devices found Unable to access resume device (/dev/mapper/mpath0p2) mount: could not find filesystem '/dev/root' setuproot: moving /dev failed: No such file or directory setuproot: error mounting /proc: No such file or directory setuproot: error mounting /sys: no such file or directory switchroot: mount failed: No such file or directory Kernel panic - not syncing: Attempted to kill init! --------- In rescue mode, "multipath -ll" reports the path as: mpath0 () dm-0 DGC,RAID 5 [size=25G][features=0][hwhandler=0][rw] On RHEL5.5 when using Active-Passive configuration, the device appears as (this is expected): mpath0 (3600601601680290014bf991a3d01e011) dm-0 DGC,RAID 5 [size=25G][features=1 queue_if_no_path][hwhandler=1 emc][rw] Works correctly with RHEL5.4 kernels (2.6.18-164.*), or with RHEL5.5 kernels (2.6.18-194.*) and Active-Passive configured. Version-Release number of selected component (if applicable): - Red Hat Enterprise Linux 5.5 (kernel 2.6.18-194.*) - EMC CX120 SAN configured in Active-Active mode. - Tested on device-mapper-multipath-0.4.7-34.el5.x86_64 (but hardware handler lives in the kernel) How reproducible: Very reproducible with access to hardware. Steps to Reproduce: 1. Install RHEL5.5 server in boot-from-SAN multipath'd configuration from EMC CX120 SAN in Active-Passive mode. 2. Ensure dm-multipath is configured using defaults for EMC RAID5 devices. 3. Change LUN configuration on EMC SAN to "Active-Active" mode. 4. Reboot the host. Actual results: - Device mapper fails to create multipath device. Leaves errors: device-mapper: table: 253:0: multipath: error attaching hardware handler - System fails to boot as the root filesystem cannot be found due to missing multipath device. Expected results: - device-mapper-multipath correctly creates the multipath device as it does in active-passive mode. - System boots up as it does in active-passive mode without panicking. Additional info: - This is reproducible without boot-from-SAN, but our test environment used this. - RHEL5.4 kernels boot fine in either A-A mode or A-P mode. - RHEL5.5 kernels fail in A-A mode but succeed in A-P mode. - As a workaround, we configured dm-multipath not to use the hardware handler, and we were able to boot using RHEL5.5 and A-A mode. Configuration: device { vendor "DGC" product ".*" product_blacklist "LUNZ" getuid_callout "/sbin/scsi_id -g -u -s /block/%n" prio_callout "/sbin/mpath_prio_emc /dev/%n" features "1 queue_if_no_path" # Disable hardware handler for active-active mode # hardware_handler "1 emc" hardware_handler "0" path_grouping_policy group_by_prio failback immediate rr_weight uniform no_path_retry 60 rr_min_io 1000 path_checker emc_clariion }
New hardware handler scsi_dh_emc was added in RHEL5.5, and it appears to be at the heart of the issue. http://bugzilla.redhat.com/show_bug.cgi?id=437107 http://people.redhat.com/mchristi/scsi/alua/5.4/v4/0002-RHEL-5.4-Add-scsi_dh_emc.patch Also, changing component to kernel as that is where the hardware handler lives.
Can you check that the scsi_dh_emc module is in the initramfs? Is dm_emc in there?
And is the box in ALUA mode or Tresspass?
Created attachment 481007 [details] init filesystem archive
Created attachment 481009 [details] The requested initrd
Created attachment 482082 [details] Screenshot showing alua errors
Created attachment 486790 [details] Video of booting
this bug is reproducible also on RHEL 5.6 EMC CX4-40 LUN configured in active-active (ALUA) server kistarted using RHEL 5.6 distribution media 'mpath' passed to kernel append line to make this kickstart create SAN boot disk kickstart config file has "part ... --ondisk mapper/mpath0" lines to configure system partitions install proceeds and completes as it did in 5.4 in /etc/fstab system partition devices to be mounted as /boot or "/" or /var are /dev/mapper/mpath0pX when system boots it throws errors table: 253:0: multipath: error attaching hardware handler ioctl: error adding target to table We managed to boot it up using partition labels. When system comes up system partitions are /dev/sdXY not dm devices that we would expect. Ofcourse this is wrong and not acceptable. a quick /etc/init.d/multipathd restart produces these in dmesg: device-mapper: multipath: version 1.0.5 loaded device-mapper: multipath round-robin: version 1.0.0 loaded device-mapper: multipath emc: version 0.0.3 loaded device-mapper: dm-raid45: initialized v0.2594l device-mapper: multipath: Using scsi_dh module scsi_dh_emc for failover/failback and device management. device-mapper: table: 253:0: multipath: error attaching hardware handler device-mapper: ioctl: error adding target to table and it keeps repeating # uname -a Linux ******* 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
(In reply to comment #44) > /etc/init.d/multipathd restart > > produces these in dmesg: > > device-mapper: multipath: version 1.0.5 loaded > device-mapper: multipath round-robin: version 1.0.0 loaded > device-mapper: multipath emc: version 0.0.3 loaded Could you retry the multipathd restart test, but make sure dm-emc does not get loaded. To do this do rmmod dm-emc mv /lib/modules/$YOUR_KERNEL_VERSION/kernel/drivers/md/dm-emc.ko some-tmp-place then rerun the multipathd restart test? > device-mapper: dm-raid45: initialized v0.2594l > device-mapper: multipath: Using scsi_dh module scsi_dh_emc for > failover/failback and device management. > device-mapper: table: 253:0: multipath: error attaching hardware handler > device-mapper: ioctl: error adding target to table Could you send all of the /var/log/messages. Send the messages for the failed restart test and the messages for the restart with dm-emc not loaded. Send it all. I want to see the dm messages like above and see the messages from the scsi layer when the scsi devices are first discovered.
Looks like we are just missing this patch: http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=6c10db72c94818573552fd71c89540da325efdfb which fixes the scsi_dh_detach function. This should allow dm-multipath to detach scsi_dh_alua and use scsi_dh_emc's compat mode. I think for a long term fix we might want to have multipathd determine if it should use emc or alua dynamically. It can do this by sending an inquiry and checking if the device supports alua. If it does use that. Ben, what do you think about the multipathd alua detection? Do you want to clone this bz?
Created attachment 487713 [details] fix scsi_dh_detach Allow kernel to detach scsi_dh module.
Here is a test kernel with the patch. http://people.redhat.com/mchristi/kernel-2.6.18-249.el5.scsi_dh_detach0.x86_64.rpm http://people.redhat.com/mchristi/kernel-2.6.18-249.el5.scsi_dh_detach0.i686.rpm http://people.redhat.com/mchristi/kernel-PAE-2.6.18-249.el5.scsi_dh_detach0.i686.rpm MikeB, have your customer try this kernel instead of the test suggested in comment 51.
I have performed some simple tests with new test kernel. I see no difference, see attached text file
Created attachment 488216 [details] test kernel 2.6.18-249.el5.scsi_dh_detach0 test results
Thanks for testing Greg. Mike B, hold up on testing. I think I know what went wrong. Will update bz later today.
Here is a updated kernel: http://people.redhat.com/mchristi/kernel-2.6.18-249.el5.detach2.x86_64.rpm I got access to a clarrion here, and tested the patch, so it should work.
no go for me with 2.6.18-249.el5.detach2 initrd boot phase clean, boot disk seems to be detected as /dev/sdd (root partition sdd2) switch to root fs as per 'kernel' line from grub.conf seems to take place, root identified by disk label here i can see udev start with OK status and little pause and then bunch of errors complaining about fs superblock and other fs parameters as if fs was corrupted, I personally think that at this point it expects root fs to come from/dev/sdd2 while it cannot see this device. Active path changed to one of other /dev/sdX ??? strange thing though is that it starts executing rc.sysinit and throws more errors from there and then it gets stuck there is no other disks at this point visible to the system, only boot disk but with 4 paths possibly so therefore 4 scsi devices /dev/sd[a-d] possible hint, it does not tell me anything but it may to Chris: on original kickstart build with 'mpath' directive it will result in kernel panic on switch from miniroot as it could not find root=/dev/mapper/mpath0p2 as defined in grub.conf we labeled disk partitions and changed to root=LABEL=/root_mp it booted this way, and I noticed that initrd for 'mpath' build was twice the size of regular build on local disk. I decided to remake initrd once I was able to boot it up with labels and see what happens. On attempt to boot with new remade initrd system would behave exactly the same way as with 2.6.18-249.el5.detach2 kernel.
Greg, Could you send the log like you did in comment #56?
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 488825 [details] console screenshots for kernel-2.6.18-249.el5.detach2 I have re-created the entire build process, to take no chances. Using stock RHEL 5.6 media performed kickstart with 'mpath' directive. In the postinstall section applied 'rpm -i kernel-2.6.18-249.el5.detach2.rpm' system at this point sees one disk only that is SAN boot disk. System boots up into kernel-2.6.18-249.el5.detach2 without errors. system partition references in grub.conf and fstab are /dev/mapper/mpath0pX reason is that that is what anaconda/installer kernel makes available and that is what is being used in 'part' directive of kickstart. long term using /dev/mapper/mpath0pX is not sustainable, on the very same system I added another LUN and SAN boot disk (LUN) on next mock up install became mpath1 device. Attempted to switch to labels by creating custom labels for "/" boot var and tmp partitions respectively san_root, san_boot, san_var and san_tmp. Changed references in grub.conf and fstab to labels and rebooted. Attached screenshots document relevant part of boot. Something still seems to be wrong with devmapper as when it attempts to fsck root fs it goes and grabs /dev/sdd2 (one of the scsi paths to root partition) and fails to repair prompt. In repair mode remounted root fs rw and tried few things, see screenshots. Interesting are attempts to mount /boot and /tmp, it looks like devmapper is hitting inactive path and fails to mount. That is when I suppose it tries to use LABEL from fstab to get to the partition. If /dev/mapper device is given mount succeed. PUZZLED :(
Greg, Thanks for the testing. I am going to send the fix to the kernel code to get merged in rhel 5.7. I think for the label issue we should open a new bz, because we have to do a bz per issue since the bz scripts will automatically close this one when the kernel patch gets merged and released in a new kernel.
As mach as merging this patch removes certain conditions resulting in errors but we stil have non-functioning devmapper IMHO, see below: 1. I restored grub.conf and fstab references to system partitions back to using /dev/mapper/mpath0pX - again this notation is not sustainable really longterm as bootdisk may become another mpathX should more luns are presented to the system however now with one LUN just for testing i can accept this 2. system boots clean here are however outputs that really concern me: root@hpwbls025:/root# multipath -ll mpath0 (360060160c6b0220058e85640c553e011) dm-0 DGC,VRAID [size=35G][features=1 queue_if_no_path][hwhandler=1 emc][rw] \_ round-robin 0 [prio=2][active] \_ 0:0:0:0 sda 8:0 [active][ready] \_ 1:0:1:0 sdd 8:48 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 0:0:1:0 sdb 8:16 [active][ready] \_ 1:0:0:0 sdc 8:32 [active][ready] root@hpwbls025:/root# for i in a b c d > do > fdisk -l /dev/sd$i > done Disk /dev/sda: 37.5 GB, 37580963840 bytes 255 heads, 63 sectors/track, 4568 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 * 1 65 522081 83 Linux /dev/sda2 66 2615 20482875 83 Linux /dev/sda3 2616 3635 8193150 83 Linux /dev/sda4 3636 4568 7494322+ 5 Extended /dev/sda5 3636 3896 2096451 83 Linux Disk /dev/sdd: 37.5 GB, 37580963840 bytes 255 heads, 63 sectors/track, 4568 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdd1 * 1 65 522081 83 Linux /dev/sdd2 66 2615 20482875 83 Linux /dev/sdd3 2616 3635 8193150 83 Linux /dev/sdd4 3636 4568 7494322+ 5 Extended /dev/sdd5 3636 3896 2096451 83 Linux aha ! he cannot see the boot disk down path 'b' and 'c' confirmation of the same from dmesg: Buffer I/O error on device sdb, logical block 0 Buffer I/O error on device sdb, logical block 9175039 Buffer I/O error on device sdb, logical block 0 Buffer I/O error on device sdc, logical block 0 Buffer I/O error on device sdc, logical block 9175039 Buffer I/O error on device sdc, logical block 0 In my opinion this BZ that we attempt to close maybe technically fixes something but devmapper is still left broken. How do I open BZ against something that does not exists :) I cannot obviously open against this test kernel, can I ?
Hey, I think I goofed on the bz management. Originally we were debugging the active/active alua problem but I think I ended up fixing a slightly different problem with this bz. We are getting the "multipath: error attaching hardware handler" then boot failure, because of a kernel bug where if we have scsi_dh_alua attached but userspace is requesting emc, due to that being in the multipath.conf, then dm multipath setup fails when we try to detach alua and attach emc. That is fixed with the patch in this bz, so booting up should work again. For the case where we are in active active alua mode, and when scsi_dh_emc attaches and detects we are in alua mode, but incorrectly fails IO because it thinks we are in active passive mode then lets use https://bugzilla.redhat.com/show_bug.cgi?id=692239 And for making the alua and tresspass mode detection dynamic so multipath tools just automatically use the correct one, Ben can make a bz for multipath-tools.
(In reply to comment #65) > For the case where we are in active active alua mode, and when scsi_dh_emc > attaches and detects we are in alua mode, but incorrectly fails IO because it > thinks we are in active passive mode then lets use > https://bugzilla.redhat.com/show_bug.cgi?id=692239 > > And for making the alua and tresspass mode detection dynamic so multipath tools > just automatically use the correct one, Ben can make a bz for multipath-tools. Actually, Ben, did you want to just use bz 692239? I am not sure if there is anything I can do. If userspace tells us to use the emc module when we are in alua active active mode, then it will screw us up, so we want userspace to do the right thing.
Patch(es) available in kernel-2.6.18-255.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
(In reply to comment #68) > Patch(es) available in kernel-2.6.18-255.el5 > You can download this test kernel (or newer) from > http://people.redhat.com/jwilson/el5 > Detailed testing feedback is always welcomed. SAN boot disk from CLARiiON with ALUA active-active and multipath works fine under condition that hardware handler is changed from default to 'alua': devices { device { vendor "DGC" product ".*" product_blacklist LUNZ path_grouping_policy group_by_prio getuid_callout "/sbin/scsi_id -g -u -s /block/%n" path_selector "round-robin 0" path_checker emc_clariion features "1 queue_if_no_path" hardware_handler "1 alua" prio_callout "/sbin/mpath_prio_alua /dev/%n" failback immediate rr_weight uniform no_path_retry 60 rr_min_io 1000 } Have not tested anything else but noticed that once online the system spews this: be2net 0000:04:00.0: Error in cmd completion - opcode 121, compl 2, extd 30 It does that with the frequency of human heartbeat :)
(In reply to comment #66) > Actually, Ben, did you want to just use bz 692239? I am not sure if there is > anything I can do. Ben, ignore this for now. I think we are going to try to make the kernel attachment more dynamic which should solve the problem here.
Hi Mike, are you thinking on the lines of an auto-discovery/configuration?
(In reply to comment #72) > Hi Mike, are you thinking on the lines of an auto-discovery/configuration? Sort of. I think the idea is that the scsi_dh modules will try to figure out what is best. This would replace the behavior where it just depends on what scsi_dh module gets loaded first. So I think the patches would do something like this: 1. Send inquiry, and if ALUA is supported use that. 2. Try device specific. For EMC basically call clariion_bus_attach. Do you see any problems?
I'd suggest checking PNR first, and if not supported, then use ALUA. That would eliminate any weirdness which might occur if older arrays (which don't support ALUA) are probed.
(In reply to comment #74) > I'd suggest checking PNR first, and if not supported, then use ALUA. That would > eliminate any weirdness which might occur if older arrays (which don't support > ALUA) are probed. Ok. Makes sense. If the device indicates it supports both like in this bz would we want to use ALUA?
(In reply to comment #75) > (In reply to comment #74) > > I'd suggest checking PNR first, and if not supported, then use ALUA. That would > > eliminate any weirdness which might occur if older arrays (which don't support > > ALUA) are probed. > > Ok. Makes sense. > > If the device indicates it supports both like in this bz would we want to use > ALUA? To give some more info here. In this bz we are loading scsi_dh_emc first, and so clariion_bus_attach runs and determines it is ok for it to attach to the device. It however, does hit that ALUA check in parse_sp_info_reply and sees that the device supports ALUA. But scsi_dh_emc decides to stay in PNR mode. The problem is then that the device is active-active alua mode, and scsi_dh_emc basically does not know what that is. scsi_dh_emc sees one path as CLARIION_LUN_OWNED and the other as CLARIION_LUN_BOUND. In clariion_prep_fn, if the lun state is not CLARIION_LUN_OWNED it will fail the IO, and so users are seeing IO failed.
So I guess I am saying I am not sure if there is a bug in scsi_dh_emc or if I should be looking for something in inquiry data differently in scsi_dh_emc for this case?
I can check into the inquiry returns on Monday. Let me get a look at the logic in clariion_bus_attach and see -- we might have to do something like "if we see we're already ALUA-qualified, then warn the user and go for ALUA". But that of course might be interesting if the user has set up a custom config thinking they're in PNR mode... ugh.
In the standard inquiry return, byte 5 bits 4 and 5 (zero-relative) contain the TPGS status. They'll be 00 on a PNR array and 11 on an ALUA box. The lun_state information can be ambiguous as 0x01 and 0x02 are returned in both ALUA and PNR modes but means different things; 0x01 (PNR) LUN bound and assigned to the OTHER SP 0x01 (ALUA) LUN bound and using the non-optimized path 0x02 (PNR) LUN bound and assigned to THIS SP 0x02 (ALUA) LUN bound and using the optimized path So I'd say we check TPGS first and then we can interpret the lun_state accordingly based on that info...
(In reply to comment #75) I should caution you that an array can support either ALUA or PNR _and_ could be user configured to either. Arrays prior to the newer VNX series were default configured to PNR and user selectable to ALUA. The new VNX series array is nowt he opposite - default configured to ALUA and user selectable to PNR.
Correct. The TPGS inquiry return is set per initiator (it changes with the failovermode setting of 1 or 4). So it really does depend on correct user setup, but if we probe it up front we should always return the right data.
Thanks Jerry and Wayne.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html