Bug 445268
Summary: | [NetApp 4.8 bug] multipath may create WWID named multipath maps even when user_friendly_names is set | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Dave Wysochanski <dwysocha> |
Component: | device-mapper-multipath | Assignee: | Dave Wysochanski <dwysocha> |
Status: | CLOSED ERRATA | QA Contact: | Corey Marthaler <cmarthal> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.7 | CC: | agk, andriusb, bmarzins, christophe.varoqui, cward, dwysocha, edamato, egoggin, heinzm, junichi.nomura, kueda, lmb, mbroz, naveenr, prockai, tranlan, xdl-redhat-bugzilla |
Target Milestone: | beta | Keywords: | OtherQA |
Target Release: | 4.8 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
* multipath was sometimes unable to get a lock on the bindings file before it reached its timeout limit for doing so (3 seconds). If this happened, multipath could not create devices with user-friendly names and would create multipath maps named with WWIDs even when the user_friendly_names option was set. The timeout for multipath to get a lock on the bindings file is now 30 seconds, which ensures that multipath always honors the user_friendly_names option.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2009-05-18 20:31:46 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 445585, 450897 |
Description
Dave Wysochanski
2008-05-05 20:48:58 UTC
Looks like in some cases the dm map name is actually uuid based, even in the case of user friendly names set. Race somewhere? I believe this is the result of a failure to get a lock on the bindings file. I think what happens is this: - select_alias() is called and fails at get_user_friendly_alias() - get_user_friendly_alias() fails at open_bindings_file() - open_bindings_file() fails at lock_bindings_file() So far I've been unable to prove this though as the problem seems intermittent. Confirmed this is the case. It was hard to reproduce, but this is definately what is happening. Workaround (if we want) would be to retry the fctrl() if it fails. What I did to track this down was patch /etc/dev.d/block/multipath.dev to dump output to a file rather than /dev/null. I also patched the code to print the WWID when it failed in the aforementioned path (should probably submit that as a patch for debugging since the code just silently fails there now). Then as below, I just take dmsetup output, grep out the maps that are WWID based, and then look through the log file for these maps. Sure enough, the maps that are WWID based are the same ones that fail in that code path. So now I know the precise failure - just need to think about the options and come up with code to fix this. [root@rhel4u5-node1-dwysocha ~]# dmsetup table | grep -v ^mpath | grep -v ^VolGroup | awk '{ print substr($1, 1, length($1)-1) }' | xargs -n 1 -I '{}' grep '{}' /var/lib/multipath/bindings mpath357 360a980006e424f36475a4343552d3554 mpath361 360a980006e424f36475a4343552d364a mpath374 360a980006e424f36475a437076336343 [root@rhel4u5-node1-dwysocha ~]# grep 360a980006e424f36475a437076336343 /tmp/multipathlog.txt Cannot open bindings file for WWID [360a980006e424f36475a437076336343] [root@rhel4u5-node1-dwysocha ~]# grep 360a980006e424f36475a4343552d364a /tmp/multipathlog.txt Cannot open bindings file for WWID [360a980006e424f36475a4343552d364a] [root@rhel4u5-node1-dwysocha ~]# grep 360a980006e424f36475a4343552d3554 /tmp/multipathlog.txt Cannot open bindings file for WWID [360a980006e424f36475a4343552d3554] [root@rhel4u5-node1-dwysocha ~]# grep 360a980006e424f36475a4343552d3554 /tmp/multipathlog.txt Cannot open bindings file for WWID [360a980006e424f36475a4343552d3554] I am also seeing other failure messages in the multipath log file ("lsetfilecon" failures, etc). I have not tracked these down yet. <snip of log> Cannot open bindings file for WWID [360a980006e424f36475a437076336563] Bindings file is locked. Giving up. Err = -1 Cannot open bindings file for WWID [360a980006e424f36475a4343552d3554] Bindings file is locked. Giving up. Err = -1 Cannot open bindings file for WWID [360a980006e424f36475a4343552d364a] Bindings file is locked. Giving up. Err = -1 Cannot open bindings file for WWID [360a980006e424f36475a437076336343] device-mapper: table ioctl failed: No such device or address /dev/mapper/mpath413: lsetfilecon failed: No such file or directory /dev/mapper/mpath392: lsetfilecon failed: No such file or directory /dev/mapper/mpath396: lsetfilecon failed: No such file or directory /dev/mapper/mpath383: lsetfilecon failed: No such file or directory I should clarify. I am running xen domU with iscsi underneath. The number of paths are not too large - I think there are 50 maps with 2 paths each for a total of 100 paths. A simple way to reproduce this is to do the following. Setup a multipathed LUN with at least 2 paths. Make sure no map exists for the LUN. Then: 1. Run first instance of multipath in gdb. Set a breakpoint at the end of lock_bindings_file(). Run it with "-v0 /dev/path1" 2. Run second instance of multipath with "-v0 /dev/path2". This should fail with a "Bindings file is locked" error. At this point the map will have been created with only the WWID, not the user_friendly_name. I have not tried upstream yet but at first glance it at least appears to have the same code paths. This is worse than just a display problem since the multipath map is created with the wrong name - the WWID - instead of the user_friendly_name. I'm not sure this warrants and exception for the release but it should be fixed. Some ideas: 1) Use read/shared locks unless we know we need to update the bindings file. In many cases this would reduce lock contention so we would not have to retry. We still need to address the retry problem though when new LUNs are first seen by the system 2) Retry the fcntl if we fail to obtain the lock. This is the must fix (first idea is just an optimization and reduces, but does not eliminate, the probability you'd see this problem) 1. I am hitting this issue on RHEL 4.7 Snapshot 2. I booted the machine after I mapped 256 LUNs to the host with 4 paths each (256 * 4 = 1024 paths)- mpath262 (360a9800050334c66465a4a3648554e67) [size=10 MB][features="1 queue_if_no_path"][hwhandler="0"] \_ round-robin 0 [prio=8][enabled] \_ 3:0:0:101 sdcy 70:96 [active][ready] \_ 4:0:0:101 sdmu 70:352 [active][ready] \_ round-robin 0 [prio=2][enabled] \_ 3:0:1:101 sdwq 70:608 [active][ready] \_ 4:0:1:101 sdagm 70:864 [active][ready] 360a9800050334c66465a4a3648527a6a [size=10 MB][features="1 queue_if_no_path"][hwhandler="0"] \_ round-robin 0 [prio=8][enabled] \_ 3:0:0:63 sdbm 68:0 [active][ready] \_ 4:0:0:63 sdli 68:256 [active][ready] \_ round-robin 0 [prio=2][enabled] \_ 3:0:1:63 sdve 68:512 [active][ready] \_ 4:0:1:63 sdafa 68:768 [active][ready] mpath159 (360a9800050334c4d614a4a3648627947) [size=10 MB][features="1 queue_if_no_path"][hwhandler="0"] \_ round-robin 0 [prio=8][enabled] \_ 3:0:1:236 sdabv 134:720 [active][ready] \_ 4:0:1:236 sdalr 134:976 [active][ready] \_ round-robin 0 [prio=2][enabled] \_ 3:0:0:236 sdid 134:208 [active][ready] \_ 4:0:0:236 sdrz 134:464 [active][ready] Will this bug be fixed in 4.7 release? 2. I also saw hald segfaulting while booting. This issue is intermittent. Below are extracts from syslog - Jun 25 00:26:09 lnx199-115 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Jun 25 00:26:09 lnx199-115 kernel: i2c /dev entries driver Jun 25 00:26:09 lnx199-115 kernel: lp: driver loaded but no devices found Jun 25 00:26:09 lnx199-115 kernel: hald[16527]: segfault at 0000000000000000 rip 0000003b34221b26 rsp 0000007fbfffee50 error 4 Jun 25 00:28:24 lnx199-115 sshd(pam_unix)[22123]: session opened for user root by (uid=0) This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. I just noticed there's a timeout of 3 seconds on the fcntl that obtains the lock. As a simple test I raised this to a larger value and this fixed the problem on my particular setup with 50 luns. Not clear at this point if there's a better solution that is reasonable to implement - need to explore more. diff --git a/libmultipath/alias.h b/libmultipath/alias.h index 3d64b9b..fe1191b 100644 --- a/libmultipath/alias.h +++ b/libmultipath/alias.h @@ -1,4 +1,4 @@ -#define BINDINGS_FILE_TIMEOUT 3 +#define BINDINGS_FILE_TIMEOUT 30 #define BINDINGS_FILE_HEADER \ "# Multipath bindings, Version : 1.0\n" \ "# NOTE: this file is automatically maintained by the multipath program.\n" \ I'm attempting to profile multipath to see exactly where it's taking so much time but so far no luck. I do know that if I run a script that basically fires off many multipath instances (pretty close to what happens when I do an "iscsi start"), then run "multipath -v0" manually myself, the manual run of multipath takes about 11 seconds on my testbed. Looking at the history of this code - originally the locking obtaining part had a retry mechanism in there and then later (2005) it was changed to a timeout. My only question is whether there's a bug there somewhere that a process is holding the advisory lock unnecessarily or whether it's truly necessary and we just need to live with a larger timeout. What is a reasonable "large value" is subject to debate though. Since I'm not entirely confident with my understanding of the code here, it's probably worth a little more looking to see if we can improve the efficiency. Talked with Ben and posted the timeout patch to dm-devel. Will look for better fixes to the code but timeout patch seems reasonable for now, esp considering this is 4.8 release. Timeout patch test rpms are here: http://people.redhat.com/dwysocha/test/RHEL4.8/device-mapper-multipath/ I've asked netapp to verify on their setup. Upstream in maintainer's tree, commit: 669b5c11d8c82252f9697e35d183a0c840386261 Just heard back from Netapp - preliminary tests are ok: "I just took the rpms and tested with some 255 Luns mapped to my host. Several reboots and several flush-and-recreate maps. Looks like this fix works. I didn't see any map with wwid name. However, we would test this again once the Betas are out. Best Regards, Rajashekhar M A SAN Linux Engineering NetApp Direct +91 80 4184 3538 Mobile +91 98457 72007 Rajashekhar.A www.netapp.com" Checked in to RHEL4_FC5 cvs: libmultipath/alias.h Checking in libmultipath/alias.h; /cvs/dm/multipath-tools/libmultipath/alias.h,v <-- alias.h new revision: 1.1.2.1; previous revision: 1.1 done Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: * multipath was sometimes unable to get a lock on the bindings file before it reached its timeout limit for doing so (3 seconds). If this happened, multipath could not create devices with user-friendly names and would create multipath maps named with WWIDs even when the user_friendly_names option was set. The timeout for multipath to get a lock on the bindings file is now 30 seconds, which ensures that multipath always honors the user_friendly_names option. ~~ Attention Partners! ~~ RHEL 4.8 Partner Alpha has been released on partners.redhat.com. There should be a fix present in the Beta, which addresses this URGENT priority bug. If you haven't had a chance yet to test this bug, please do so at your earliest convenience, to ensure that only the highest possible quality bits are shipped in the upcoming public Beta drop. If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. Further questions can be directed to your Red Hat Partner Manager. Thanks, more information about Beta testing to come. - Red Hat QE Partner Management ~~ Attention Partners! ~~ RHEL 4.8 Partner Alpha has been released on partners.redhat.com. There should be a fix present in the Beta, which addresses this bug. If you have already completed testing your other URGENT priority bugs, and you still haven't had a chance yet to test this bug, please do so at your earliest convenience, to ensure that only the highest possible quality bits are shipped in the upcoming public Beta drop. If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. Further questions can be directed to your Red Hat Partner Manager. Thanks, more information about Beta testing to come. - Red Hat QE Partner Management ~~ Attention Partners! ~~ RHEL 4.8Beta has been released on partners.redhat.com. There should be a fix present, which addresses this bug. Please test and report back results on this OtherQA Partner bug at your earliest convenience. If you encounter any issues, please set the bug back to the ASSIGNED state and describe any issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you've encountered. Further questions can be directed to your Red Hat Partner Manager. If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs. - Red Hat QE Partner Management Verified this issue in RHEL4.8 Beta. This issue is not seen anymore. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1011.html |