Bug 452897 - [RHEL 5.1] multipath -ll output shows mix of failover and multibus
Summary: [RHEL 5.1] multipath -ll output shows mix of failover and multibus
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath
Version: 5.1
Hardware: All
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 494961
TreeView+ depends on / blocked
 
Reported: 2008-06-25 17:35 UTC by Issue Tracker
Modified: 2018-10-20 01:27 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 11:46:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:1377 0 normal SHIPPED_LIVE device-mapper-multipath bug-fix and enhancement update 2009-09-01 12:41:23 UTC

Comment 1 Issue Tracker 2008-06-25 17:35:41 UTC
Occasionally seeing mixed output in multipath -ll as shown below

mpath5 (36006048000028746115853594d304545) dm-5 EMC,SYMMETRIX
[size=8.4G][features=0][hwhandler=0]
\\_ round-robin 0 [prio=2][enabled]
 \\_ 1:0:0:53 sdl 8:176 [active][ready]
 \\_ 0:0:0:53 sdf 8:80  [active][ready]
mpath4 (36006048000028746115853594d304537) dm-4 EMC,SYMMETRIX
[size=8.4G][features=0][hwhandler=0]
\\_ round-robin 0 [prio=1][enabled]
 \\_ 0:0:0:52 sde 8:64  [active][ready]
\\_ round-robin 0 [prio=1][enabled]
 \\_ 1:0:0:52 sdk 8:160 [active][ready]

This event sent from IssueTracker by breeves  [Support Engineering Group]
 issue 159229

Comment 4 Issue Tracker 2008-06-25 17:35:47 UTC
Bryn - I think we might be seeing the problem right now on
san-1.gsslab.rdu.redhat.com:

[root@san-1 ~]# multipath -ll
mpath1 (3600a0b80001327510000012946f2504e) dm-3 IBM,1742-900
[size=29G][features=0][hwhandler=0]
\\_ round-robin 0 [prio=6][active]
 \\_ 2:0:0:1 sdc 8:32  [active][ready]
 \\_ 3:0:0:1 sde 8:64  [active][ready]
mpath0 (3600a0b80001327d8000000114663f555) dm-2 IBM,1742-900
[size=4.0G][features=0][hwhandler=0]
\\_ round-robin 0 [prio=1][enabled]
 \\_ 2:0:0:0 sdb 8:16  [active][ready]
\\_ round-robin 0 [prio=1][enabled]
 \\_ 3:0:0:0 sdd 8:48  [active][ready]
[root@san-1 ~]# multipath -v5 -ll
dm-0: blacklisted
dm-1: blacklisted
dm-2: blacklisted
dm-3: blacklisted
dm-4: blacklisted
dm-5: blacklisted
dm-6: blacklisted
fd0: blacklisted
hdc: blacklisted
md0: blacklisted
ram0: blacklisted
ram10: blacklisted
ram11: blacklisted
ram12: blacklisted
ram13: blacklisted
ram14: blacklisted
ram15: blacklisted
ram1: blacklisted
ram2: blacklisted
ram3: blacklisted
ram4: blacklisted
ram5: blacklisted
ram6: blacklisted
ram7: blacklisted
ram8: blacklisted
ram9: blacklisted
sda: blacklisted
sdb: not found in pathvec
sdb: mask = 0x5
sdb: bus = 1
sdb: dev_t = 8:16
sdb: size = 8388608
sdb: vendor = IBM
sdb: product = 1742-900
sdb: rev = 0520
sdb: h:b:t:l = 2:0:0:0
sdb: tgt_node_name = 0x200200a0b8132751
sdb: path checker = tur (controller setting)
sdb: state = 2
sdc: not found in pathvec
sdc: mask = 0x5
sdc: bus = 1
sdc: dev_t = 8:32
sdc: size = 60609789
sdc: vendor = IBM
sdc: product = 1742-900
sdc: rev = 0520
sdc: h:b:t:l = 2:0:0:1
sdc: tgt_node_name = 0x200200a0b8132751
sdc: path checker = tur (controller setting)
sdc: state = 2
sdd: not found in pathvec
sdd: mask = 0x5
sdd: bus = 1
sdd: dev_t = 8:48
sdd: size = 8388608
sdd: vendor = IBM
sdd: product = 1742-900
sdd: rev = 0520
sdd: h:b:t:l = 3:0:0:0
sdd: tgt_node_name = 0x200200a0b8132751
sdd: path checker = tur (controller setting)
sdd: state = 2
sde: not found in pathvec
sde: mask = 0x5
sde: bus = 1
sde: dev_t = 8:64
sde: size = 60609789
sde: vendor = IBM
sde: product = 1742-900
sde: rev = 0520
sde: h:b:t:l = 3:0:0:1
sde: tgt_node_name = 0x200200a0b8132751
sde: path checker = tur (controller setting)
sde: state = 2
===== paths list =====
uuid hcil    dev dev_t pri dm_st  chk_st  vend/prod/rev
     2:0:0:0 sdb 8:16  0   [undef][ready] IBM,1742-900 
     2:0:0:1 sdc 8:32  0   [undef][ready] IBM,1742-900 
     3:0:0:0 sdd 8:48  0   [undef][ready] IBM,1742-900 
     3:0:0:1 sde 8:64  0   [undef][ready] IBM,1742-900 
params = 0 0 1 1 round-robin 0 2 1 8:32 1000 8:64 1000 
status = 1 0 0 1 1 A 0 2 0 8:32 A 0 8:64 A 1 
*word = 0, len = 1
*word = 0, len = 1
*word = 1, len = 1
*word = 1, len = 1
*word = round-robin, len = 11
*word = 0, len = 1
*word = 2, len = 1
*word = 1, len = 1
*word = 8:32, len = 4
*word = 1000, len = 4
*word = 8:64, len = 4
*word = 1000, len = 4
sdc: mask = 0x8
sdc: getprio = /sbin/mpath_prio_tpc /dev/%n (controller setting)
sdc: prio = 3
sde: mask = 0x8
sde: getprio = /sbin/mpath_prio_tpc /dev/%n (controller setting)
sde: prio = 3
*word = 1, len = 1
*word = 0, len = 1
*word = 1, len = 1
*word = A, len = 1
*word = 2, len = 1
*word = 0, len = 1
*word = A, len = 1
*word = 0, len = 1
*word = A, len = 1
*word = 1, len = 1
mpath1 (3600a0b80001327510000012946f2504e) dm-3 IBM,1742-900
[size=29G][features=0][hwhandler=0]
\\_ round-robin 0 [prio=6][active]
 \\_ 2:0:0:1 sdc 8:32  [active][ready]
 \\_ 3:0:0:1 sde 8:64  [active][ready]
params = 0 0 2 1 round-robin 0 1 1 8:16 1000 round-robin 0 1 1 8:48 1000 
status = 1 0 0 2 1 E 0 1 0 8:16 A 0 E 0 1 0 8:48 A 1 
*word = 0, len = 1
*word = 0, len = 1
*word = 2, len = 1
*word = 1, len = 1
*word = round-robin, len = 11
*word = 0, len = 1
*word = 1, len = 1
*word = 1, len = 1
*word = 8:16, len = 4
*word = 1000, len = 4
*word = 1, len = 1
*word = 1, len = 1
*word = 8:48, len = 4
*word = 1000, len = 4
sdb: mask = 0x8
sdb: getprio = /sbin/mpath_prio_tpc /dev/%n (controller setting)
sdb: prio = 1
sdd: mask = 0x8
sdd: getprio = /sbin/mpath_prio_tpc /dev/%n (controller setting)
sdd: prio = 1
*word = 1, len = 1
*word = 0, len = 1
*word = 2, len = 1
*word = E, len = 1
*word = 1, len = 1
*word = 0, len = 1
*word = A, len = 1
*word = 0, len = 1
*word = E, len = 1
*word = 1, len = 1
*word = 0, len = 1
*word = A, len = 1
*word = 1, len = 1
mpath0 (3600a0b80001327d8000000114663f555) dm-2 IBM,1742-900
[size=4.0G][features=0][hwhandler=0]
\\_ round-robin 0 [prio=1][enabled]
 \\_ 2:0:0:0 sdb 8:16  [active][ready]
\\_ round-robin 0 [prio=1][enabled]
 \\_ 3:0:0:0 sdd 8:48  [active][ready]
[root@san-1 ~]# 



This event sent from IssueTracker by breeves  [Support Engineering Group]
 issue 159229

Comment 5 Issue Tracker 2008-06-25 17:35:48 UTC
I spent some time in the lab on this yesterday. I was not able to reproduce
exactly what the customer saw, but I did see something interesting that
might be related:

1) multipath -ll shows failover mode as expected
2) service multipathd stop
3) Pull one cable
4) service multipathd start
5) multipath -ll now shows multibus mode for both devices

So, we didn't get mix/match, but we did get a total change. Not sure if
there was debugging code loaded, but I reproduced this a couple of times.
Attaching messages file.


This event sent from IssueTracker by breeves  [Support Engineering Group]
 issue 159229

Comment 6 RHEL Program Management 2008-07-14 19:13:37 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 12 Bryn M. Reeves 2009-02-05 18:57:34 UTC
The conditions under which this problem triggers were not really understood previously - I spent some time yesterday repeatedly getting this to happen under 5.3 with a Symm and lpfc HBAs. The problem has been observed "in the wild" when booting systems with a path disabled (either for testing or due to hw failures) and then later re-instating that path. These steps avoid the need for a reboot.

Steps to reproduce:

1. Disable one path of two (I disabled a port on the fc switch for this).
2. Wait for DEVLOSSTMO (1m by default)
3. Delete the disabled path via sysfs if entries persist
4. run multipath
   - discovers mpath with a single path & creates one PG
5. /etc/init.d/multipathd start
6. re-enable the switch port
7. Wait for device registration messages

At this point there seemed to be about a 50/50 chance of hitting the problem in the environment I was using yesterday. You can tell the problem happened because multipathd will log a strange "%s: failed to access path %s" message from libmultipath/structs_vec.c:verify_paths(), e.g.:

Feb  2 11:13:14 foo multipathd: mpath2: failed to access path

Notice that the 2nd %s is an empty string. This should be devname. The key point that I hadn't realised before is that the problem seems to lie in the ev_add_path code paths. We *never* see this problem when re-instating a previously failed path (I suspect it can't happen) - the problem only occurs when hot-adding a new path to an mpath from which it was absent when the map was created.

Comment 13 Bryn M. Reeves 2009-02-05 19:02:32 UTC
A bit more digging shows that multipathd really is trying to add a path with the name "". We were seeing 1m hangs during this period (same period as libmultipath/discovery.c:wait_for_file() uses when waiting for sysfs/dev entries to appear). Adding some printfs confirms this:

Feb  4 14:29:59 foo multipathd: timed out waiting for sysfs file
(/sys/block//dev) 
          ^^^

Dumping the corresponding struct path:

multipathd: mpath1: failed to access path  
multipathd: struct path { 
multipathd:     dev =  
multipathd:     dev = 8:64 
multipathd:     scsi_idlun = (nil)      
multipathd:     sg_id = (nil) 
multipathd:     wwid =
36006048000028746115853594d304537
multipathd:     vendor_id =  
multipathd:     product_id =  
multipathd:     rev =  
multipathd:     serial =  
multipathd:     tgt_node_name =         
multipathd:     size = 0 
multipathd:     checkint = 0 
multipathd:     tick = 0 
multipathd:     bus = 0 
multipathd:     state = 0 
multipathd:     dmstate = 2 
multipathd:     failcount = 0 
multipathd:     priority = 0 
multipathd:     pgindex = 1 
multipathd:     getuid = (null)         
multipathd:     getprio = (null)        
multipathd:     getprio_selected = 0    
multipathd:     multipath = (nil)       
multipathd:     fd = -1 
multipathd:     hwe = (nil) 
multipathd: }

Comment 14 Bryn M. Reeves 2009-02-05 19:06:16 UTC
Upstream killed all the hokey libsysfs stuff a while ago:

commit 8ccef7c766a4891140488d12d93e5eb930271bf2
Author: Christophe Varoqui <cvaroqui>
Date:   Thu Jun 7 22:32:50 2007 +0200

    [libmultipath] Remove libsysfs
    
    libsysfs is deprecated and doesn't work with recent kernels.
    Copied over stuff from udev and implemented our own sysfs handling.
    Much saner now.
    
    Signed-off-by: Hannes Reinecke <hare>
    Signed-off-by: Guido Guenther <agx>

Will do a test build with these changes in to see if we can narrow this down a bit more.

Comment 15 Bryn M. Reeves 2009-02-05 19:08:04 UTC
The other way to reproduce this btw is:

1. Shutdown host
2. Disable one path of two
3. Boot host (with multipathd enabled)
4. Wait for everything to come up
5. Verify multipaths
6. Re-enable 2nd path
7. Wait for device registration messages

Comment 16 Ben Marzinski 2009-03-24 20:24:56 UTC
I am able to recreate this using the steps in comment #12.  This seems to be a race between multipath and multipathd modifying the device.  Currently, both multipath and multipathd will try to create/modify multipath devices when new paths are added.  I can work around the issue simply by commenting out the following line in /etc/udev/rules.d/40-multipath.rules

KERNEL!="dm-[0-9]*", ACTION=="add", PROGRAM=="/bin/bash -c '/sbin/lsmod | /bin/grep ^dm_multipath'", RUN+="/sbin/multipath -v0 %M:%m"

so it looks like

# KERNEL!="dm-[0-9]*", ACTION=="add", PROGRAM=="/bin/bash -c '/sbin/lsmod | /bin/grep ^dm_multipath'", RUN+="/sbin/multipath -v0 %M:%m"

This keeps udev from running multipath to create the device.  Multipath will still be called on system startup. So you don't have to worry.  The only reason why this change would cause problems is if you need to have multipath devices automatically created and don't want to run multipathd (I can't think of a why you wouldn't want multipathd running).

In fedora and RHEL 6, this rule will not be there at all, an only multipathd will create/modify multipath devices for paths that are added after bootup.  I'll figure out a way to get this working without the rule change, but can you please try the workaround to verify that it solves the problem.

Comment 17 Ben Marzinski 2009-03-26 03:35:35 UTC
Like I mentioned in the previous comment, this bug is caused by a race between multipath and multipathd when a new path is added. If multipath modifies the map before the path is added by multipathd, two things happen. First, you get some annoying messages:

multipathd: mpath2: failed to access path

These don't cause any harm. But worse, in some situations, multipath will clear the map's hardware table entry, which means that the device specific configuration options won't be applied.  I've committed a fix for both of these issues.

Comment 18 Chris Van Hoof 2009-03-27 00:57:25 UTC
(In reply to comment #17)
[...]  
> I've committed a fix for both of these issues.  

Is that fix udev related, or is the udev change only a workaround, until the actual fix is released?

--chris

Comment 19 Chris Van Hoof 2009-03-27 01:01:06 UTC
(In reply to comment #16)
[...]
> I'll figure out a way to get this working without the rule change, but can you
> please try the workaround to verify that it solves the problem.  

We'll test this out ASAP, and I'll let you know what we find.

--chris

Comment 20 Ben Marzinski 2009-03-27 14:42:43 UTC
The udev fix is a workaround.  The actual fix that I committed is in the multipathd code.

Comment 21 Chris Van Hoof 2009-03-30 16:41:37 UTC
(In reply to comment #20)
> The udev fix is a workaround.  The actual fix that I committed is in the
> multipathd code.  

Thanks!
Just so we know, when is the intended release for this?  5.4, 5.3.z?

--chris

Comment 22 Ben Marzinski 2009-03-30 18:09:34 UTC
5.4  If the workaround doesn't fix it, and it's causing serious problems, 5.3.z is
possible.  But please try the workaround.

Comment 28 Chris Ward 2009-07-03 18:04:03 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 30 errata-xmlrpc 2009-09-02 11:46:14 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1377.html


Note You need to log in before you can comment on or make changes to this bug.