Bug 445268

Summary: [NetApp 4.8 bug] multipath may create WWID named multipath maps even when user_friendly_names is set
Product: Red Hat Enterprise Linux 4 Reporter: Dave Wysochanski <dwysocha>
Component: device-mapper-multipathAssignee: Dave Wysochanski <dwysocha>
Status: CLOSED ERRATA QA Contact: Corey Marthaler <cmarthal>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.7CC: agk, andriusb, bmarzins, christophe.varoqui, cward, dwysocha, edamato, egoggin, heinzm, junichi.nomura, kueda, lmb, mbroz, naveenr, prockai, tranlan, xdl-redhat-bugzilla
Target Milestone: betaKeywords: OtherQA
Target Release: 4.8   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
* multipath was sometimes unable to get a lock on the bindings file before it reached its timeout limit for doing so (3 seconds). If this happened, multipath could not create devices with user-friendly names and would create multipath maps named with WWIDs even when the user_friendly_names option was set. The timeout for multipath to get a lock on the bindings file is now 30 seconds, which ensures that multipath always honors the user_friendly_names option.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-05-18 20:31:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 445585, 450897    

Description Dave Wysochanski 2008-05-05 20:48:58 UTC
Description of problem:

While debugging another problem with dm-mp and a large number of paths, I came
across this one and I'm debugging now.  I only noticed it because I was counting
paths and multipath maps and noticed a discrepancy.

Sometimes "multipath -ll" will not display the alias in the output but just the
wwid like this (note the second entry):

mpath412 (360a980006e424f36475a437076336651)
[size=10 MB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=4][active]
 \_ 2:0:0:29 sdbg 67:160 [active][ready]
 \_ 3:0:0:29 sdbh 67:176 [active][ready]

360a980006e424f36475a4343552d3739
[size=10 MB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=4][active]
 \_ 2:0:0:7  sdp  8:240  [active][ready]
 \_ 3:0:0:7  sdo  8:224  [active][ready]

mpath360 (360a980006e424f36475a4343552d386e)
[size=10 MB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=4][active]
 \_ 2:0:0:9  sds  65:32  [active][ready]
 \_ 3:0:0:9  sdt  65:48  [active][ready]




Version-Release number of selected component (if applicable):


How reproducible:
On my current system it happens everytime with the same path so should be
debuggable/fixable.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

[root@rhel4u5-node1-dwysocha multipath-tools]# rpm -q device-mapper-multipath
device-mapper-multipath-0.4.5-31.el4
[root@rhel4u5-node1-dwysocha multipath-tools]# uname -a
Linux rhel4u5-node1-dwysocha 2.6.9-67.0.7.ELxenU #1 SMP Wed Feb 27 05:07:29 EST
2008 x86_64 x86_64 x86_64 GNU/Linux

Comment 1 Dave Wysochanski 2008-05-05 20:59:09 UTC
Looks like in some cases the dm map name is actually uuid based, even in the
case of user friendly names set.  Race somewhere?

Comment 2 Dave Wysochanski 2008-05-06 20:22:57 UTC
I believe this is the result of a failure to get a lock on the bindings file.  I
think what happens is this:
- select_alias() is called and fails at get_user_friendly_alias()
- get_user_friendly_alias() fails at open_bindings_file()
- open_bindings_file() fails at lock_bindings_file()

So far I've been unable to prove this though as the problem seems intermittent.

Comment 3 Dave Wysochanski 2008-05-06 21:56:08 UTC
Confirmed this is the case.  It was hard to reproduce, but this is definately
what is happening.  Workaround (if we want) would be to retry the fctrl() if it
fails.

What I did to track this down was patch /etc/dev.d/block/multipath.dev to dump
output to a file rather than /dev/null.  I also patched the code to print the
WWID when it failed in the aforementioned path (should probably submit that as a
patch for debugging since the code just silently fails there now).  Then as
below, I just take dmsetup output, grep out the maps that are WWID based, and
then look through the log file for these maps.  Sure enough, the maps that are
WWID based are the same ones that fail in that code path.

So now I know the precise failure - just need to think about the options and
come up with code to fix this.

[root@rhel4u5-node1-dwysocha ~]# dmsetup table | grep -v ^mpath | grep -v
^VolGroup | awk '{ print substr($1, 1, length($1)-1) }' | xargs -n 1 -I '{}'
grep '{}' /var/lib/multipath/bindings 
mpath357 360a980006e424f36475a4343552d3554
mpath361 360a980006e424f36475a4343552d364a
mpath374 360a980006e424f36475a437076336343
[root@rhel4u5-node1-dwysocha ~]# grep 360a980006e424f36475a437076336343
/tmp/multipathlog.txt 
Cannot open bindings file for WWID [360a980006e424f36475a437076336343]
[root@rhel4u5-node1-dwysocha ~]# grep 360a980006e424f36475a4343552d364a
/tmp/multipathlog.txt 
Cannot open bindings file for WWID [360a980006e424f36475a4343552d364a]
[root@rhel4u5-node1-dwysocha ~]# grep 360a980006e424f36475a4343552d3554
/tmp/multipathlog.txt 
Cannot open bindings file for WWID [360a980006e424f36475a4343552d3554]
[root@rhel4u5-node1-dwysocha ~]# grep 360a980006e424f36475a4343552d3554
/tmp/multipathlog.txt 
Cannot open bindings file for WWID [360a980006e424f36475a4343552d3554]


Comment 4 Dave Wysochanski 2008-05-06 21:58:25 UTC
I am also seeing other failure messages in the multipath log file ("lsetfilecon"
failures, etc).  I have not tracked these down yet.

<snip of log>
Cannot open bindings file for WWID [360a980006e424f36475a437076336563]
Bindings file is locked. Giving up.
Err = -1
Cannot open bindings file for WWID [360a980006e424f36475a4343552d3554]
Bindings file is locked. Giving up.
Err = -1
Cannot open bindings file for WWID [360a980006e424f36475a4343552d364a]
Bindings file is locked. Giving up.
Err = -1
Cannot open bindings file for WWID [360a980006e424f36475a437076336343]
device-mapper: table ioctl failed: No such device or address
/dev/mapper/mpath413: lsetfilecon failed: No such file or directory
/dev/mapper/mpath392: lsetfilecon failed: No such file or directory
/dev/mapper/mpath396: lsetfilecon failed: No such file or directory
/dev/mapper/mpath383: lsetfilecon failed: No such file or directory


Comment 5 Dave Wysochanski 2008-05-06 22:05:33 UTC
I should clarify.  I am running xen domU with iscsi underneath.  The number of
paths are not too large - I think there are 50 maps with 2 paths each for a
total of 100 paths.

Comment 6 Dave Wysochanski 2008-05-07 15:19:27 UTC
A simple way to reproduce this is to do the following.

Setup a multipathed LUN with at least 2 paths.  Make sure no map exists for the
LUN.  Then:
1. Run first instance of multipath in gdb.  Set a breakpoint at the end of
lock_bindings_file().  Run it with "-v0 /dev/path1"
2. Run second instance of multipath with "-v0 /dev/path2".  This should fail
with a "Bindings file is locked" error.

At this point the map will have been created with only the WWID, not the
user_friendly_name.

I have not tried upstream yet but at first glance it at least appears to have
the same code paths.

This is worse than just a display problem since the multipath map is created
with the wrong name - the WWID - instead of the user_friendly_name.  I'm not
sure this warrants and exception for the release but it should be fixed.

Some ideas:
1) Use read/shared locks unless we know we need to update the bindings file.  In
many cases this would reduce lock contention so we would not have to retry.  We
still need to address the retry problem though when new LUNs are first seen by
the system
2) Retry the fcntl if we fail to obtain the lock.  This is the must fix (first
idea is just an optimization and reduces, but does not eliminate, the
probability you'd see this problem)

Comment 7 Rajashekhar M A 2008-06-24 19:44:50 UTC
1. I am hitting this issue on RHEL 4.7 Snapshot 2.
I booted the machine after I mapped 256 LUNs to the host with 4 paths each (256
* 4 = 1024 paths)-

mpath262 (360a9800050334c66465a4a3648554e67)
[size=10 MB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=8][enabled]
 \_ 3:0:0:101 sdcy  70:96    [active][ready]
 \_ 4:0:0:101 sdmu  70:352   [active][ready]
\_ round-robin 0 [prio=2][enabled]
 \_ 3:0:1:101 sdwq  70:608   [active][ready]
 \_ 4:0:1:101 sdagm 70:864   [active][ready]

360a9800050334c66465a4a3648527a6a
[size=10 MB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=8][enabled]
 \_ 3:0:0:63  sdbm  68:0     [active][ready]
 \_ 4:0:0:63  sdli  68:256   [active][ready]
\_ round-robin 0 [prio=2][enabled]
 \_ 3:0:1:63  sdve  68:512   [active][ready]
 \_ 4:0:1:63  sdafa 68:768   [active][ready]

mpath159 (360a9800050334c4d614a4a3648627947)
[size=10 MB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=8][enabled]
 \_ 3:0:1:236 sdabv 134:720  [active][ready]
 \_ 4:0:1:236 sdalr 134:976  [active][ready]
\_ round-robin 0 [prio=2][enabled]
 \_ 3:0:0:236 sdid  134:208  [active][ready]
 \_ 4:0:0:236 sdrz  134:464  [active][ready]

Will this bug be fixed in 4.7 release?

2. I also saw hald segfaulting while booting. This issue is intermittent.

Below are extracts from syslog -

Jun 25 00:26:09 lnx199-115 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Jun 25 00:26:09 lnx199-115 kernel: i2c /dev entries driver
Jun 25 00:26:09 lnx199-115 kernel: lp: driver loaded but no devices found
Jun 25 00:26:09 lnx199-115 kernel: hald[16527]: segfault at 0000000000000000 rip
0000003b34221b26 rsp 0000007fbfffee50 error 4
Jun 25 00:28:24 lnx199-115 sshd(pam_unix)[22123]: session opened for user root
by (uid=0)


Comment 8 RHEL Program Management 2008-09-05 17:22:23 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Dave Wysochanski 2008-09-23 02:24:13 UTC
I just noticed there's a timeout of 3 seconds on the fcntl that obtains the lock.  As a simple test I raised this to a larger value and this fixed the problem on my particular setup with 50 luns.  Not clear at this point if there's a better solution that is reasonable to implement - need to explore more.

diff --git a/libmultipath/alias.h b/libmultipath/alias.h
index 3d64b9b..fe1191b 100644
--- a/libmultipath/alias.h
+++ b/libmultipath/alias.h
@@ -1,4 +1,4 @@
-#define BINDINGS_FILE_TIMEOUT 3
+#define BINDINGS_FILE_TIMEOUT 30
 #define BINDINGS_FILE_HEADER \
 "# Multipath bindings, Version : 1.0\n" \
 "# NOTE: this file is automatically maintained by the multipath program.\n" \

Comment 10 Dave Wysochanski 2008-09-23 03:49:02 UTC
I'm attempting to profile multipath to see exactly where it's taking so much time but so far no luck.  I do know that if I run a script that basically fires off many multipath instances (pretty close to what happens when I do an "iscsi start"), then run "multipath -v0" manually myself, the manual run of multipath takes about 11 seconds on my testbed.

Looking at the history of this code - originally the locking obtaining part had a retry mechanism in there and then later (2005) it was changed to a timeout.  My only question is whether there's a bug there somewhere that a process is holding the advisory lock unnecessarily or whether it's truly necessary and we just need to live with a larger timeout.  What is a reasonable "large value" is subject to debate though.  Since I'm not entirely confident with my understanding of the code here, it's probably worth a little more looking to see if we can improve the efficiency.

Comment 11 Dave Wysochanski 2008-09-23 18:06:54 UTC
Talked with Ben and posted the timeout patch to dm-devel.  Will look for better fixes to the code but timeout patch seems reasonable for now, esp considering this is 4.8 release.

Comment 12 Dave Wysochanski 2008-09-24 17:02:59 UTC
Timeout patch test rpms are here:
http://people.redhat.com/dwysocha/test/RHEL4.8/device-mapper-multipath/

I've asked netapp to verify on their setup.

Comment 13 Dave Wysochanski 2008-10-09 17:32:28 UTC
Upstream in maintainer's tree, commit: 669b5c11d8c82252f9697e35d183a0c840386261

Just heard back from Netapp - preliminary tests are ok:

"I just took the rpms and tested with some 255 Luns mapped to my host.
Several reboots and several flush-and-recreate maps.
Looks like this fix works. I didn't see any map with wwid name.

However, we would test this again once the Betas are out.

Best Regards,

Rajashekhar M A
SAN Linux Engineering
NetApp
Direct +91 80 4184 3538
Mobile +91 98457 72007
Rajashekhar.A
www.netapp.com"

Comment 14 Dave Wysochanski 2008-10-09 19:06:35 UTC
Checked in to RHEL4_FC5 cvs:
libmultipath/alias.h
Checking in libmultipath/alias.h;
/cvs/dm/multipath-tools/libmultipath/alias.h,v  <--  alias.h
new revision: 1.1.2.1; previous revision: 1.1
done

Comment 16 Ruediger Landmann 2009-02-02 05:46:38 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
* multipath was sometimes unable to get a lock on the bindings file before it reached its timeout limit for doing so (3 seconds). If this happened, multipath could not create devices with user-friendly names and would create multipath maps named with WWIDs even when the user_friendly_names option was set. The timeout for multipath to get a lock on the bindings file is now 30 seconds, which ensures that multipath always honors the user_friendly_names option.

Comment 17 Chris Ward 2009-02-20 12:09:07 UTC
~~ Attention Partners!  ~~
RHEL 4.8 Partner Alpha has been released on partners.redhat.com. There should be a fix present in the Beta, which addresses this URGENT priority bug. If you haven't had a chance yet to test this bug, please do so at your earliest convenience, to ensure that only the highest possible quality bits are shipped in the upcoming public Beta drop.

If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. Further questions can be directed to your Red Hat Partner Manager.

Thanks, more information about Beta testing to come.
 - Red Hat QE Partner Management

Comment 18 Chris Ward 2009-02-20 13:30:32 UTC
~~ Attention Partners!  ~~
RHEL 4.8 Partner Alpha has been released on partners.redhat.com. There should
be a fix present in the Beta, which addresses this bug. If you have already completed testing your other URGENT priority bugs, and you still haven't had a chance yet to test this bug, please do so at your earliest convenience, to ensure that only the highest possible quality bits are shipped in the upcoming public Beta drop.

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. Further questions can be directed to your
Red Hat Partner Manager.

Thanks, more information about Beta testing to come.
 - Red Hat QE Partner Management

Comment 19 Chris Ward 2009-03-13 14:02:58 UTC
~~ Attention Partners!  ~~
RHEL 4.8Beta has been released on partners.redhat.com. There should
be a fix present, which addresses this bug. Please test and report back results on this OtherQA Partner bug at your earliest convenience.

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe any issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you've encountered. Further questions can be directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs.

 - Red Hat QE Partner Management

Comment 20 Naveen Reddy 2009-03-17 08:44:32 UTC
Verified this issue in RHEL4.8 Beta. 
This issue is not seen anymore.

Comment 22 errata-xmlrpc 2009-05-18 20:31:46 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1011.html