Bug 181309

Summary: update 'no_path_retry' with 'multipath' command fails to update dm table
Product: Red Hat Enterprise Linux 4 Reporter: Lan Tran <tranlan>
Component: device-mapper-multipathAssignee: Alasdair Kergon <agk>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3CC: agk, bmarzins, christophe.varoqui, egoggin, junichi.nomura, kueda, lmb, mbroz, tranlan, xdl-redhat-bugzilla
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: RHEA-2006-0513 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-08-10 21:45:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 181409    

Description Lan Tran 2006-02-13 13:05:35 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050729 Fedora/1.0.6-3.EL (CK-IBM) Firefox/1.0.6

Description of problem:
If change no_path_retry setting by updating /etc/multipath.conf and running 'multipath' to reconfigure, the dm tables are not updated to reflect the new no_path_retry settings. 

Version-Release number of selected component (if applicable):
device-mapper-multipath-0.4.5-12.0.RHEL4

How reproducible:
Always

Steps to Reproduce:
1. configure with 'multipath' and 'no_path_retry queue' option

# dmsetup table mpath11
0 1953152 multipath 1 queue_if_no_path 0 1 1 round-robin 0 4 1 68:112 10 68:240 10 8:112 10 8:240 10

2. edit /etc/multipath.conf with new 'no_path_retry fail' setting to disable queueing 
3. run 'multipath -v3'

Found matching wwid [1IBM_____2105____________73FFCA30] in bindings file.
Setting alias to mpath11
pgpolicy = group_by_serial (controler setting)
selector = round-robin 0 (controler setting)
features = 0 (internal default)
hwhandler = 0 (controler setting)
rr_weight = 1 (internal default)
no_path_retry = -1 (config file default)
pg_timeout = NONE (internal default)
0 1953152 multipath 0 0 1 1 round-robin 0 4 1 68:112 10 68:240 10 8:112 10 8:240 10
set ACT_NOTHING: map unchanged

 

Actual Results:  ACTUAL: table did not get updated

# dmsetup table mpath11
0 1953152 multipath 1 queue_if_no_path 0 1 1 round-robin 0 4 1 68:112 10 68:240 10 8:112 10 8:240 10

Expected Results:  EXPECTED: queue_if_no_path feature to be disabled

e.g. same table as printed during 'multipath -v3'

0 1953152 multipath 0 0 1 1 round-robin 0 4 1 68:112 10 68:240 10 8:112 10 8:240 10


Additional info:

Looks like the maps are not getting updated because the check for the no_path_retry setting during multipath configuration that determines if table needs to be updated or not is bypassed because of incorrect return value.
                                                                                               
                                                                 
--- multipath-tools-0.4.5-12.0/multipath/main.c 2005-12-12 18:05:53.000000000 -0800
+++ multipath-tools-0.4.5-12.0-patched/multipath/main.c 2006-02-13 04:05:12.000000000 -0800
@@ -646,7 +646,7 @@
        switch (mpp->action) {
        case ACT_NOTHING:
        case ACT_RELOADFEATURES:
-               return 0;
+               return 2;
                                                                                
        case ACT_SWITCHPG:
                dm_switchgroup(mpp->alias, mpp->bestpg);

Comment 1 Kiyoshi Ueda 2006-02-16 15:44:36 UTC
As you know, 'no_path_retry' is mainly a daemon side parameter.
So, the right way to reflect the updated multipath.conf
is to notify the daemon.  e.g. multipathd -k"reconfigure".

The patch in the original report isn't a complete fix.
I think that returning 2 for ACT_RELOADFEATURES is correct,
but ACT_NOTHING should return 0 because it should mean
the map doesn't exist and dm ioctl can not be issued,
or dm ioctl isn't really needed.

For example, if 'prio_callout' fail when you create a new map,
the map isn't created, but dm_queue_if_no_path() will be executed
and reported error like below.

------------------------------------------------------------------
# multipath
error calling out /bin/false
error calling out /bin/false
device-mapper: message ioctl failed: No such device or address
------------------------------------------------------------------

Following patch sets ACT_RELOADFEATURES when dm_queue_if_no_path()
is really needed.


--- 0.4.5.56.org/multipath/main.c	2005-12-12 21:05:53.000000000 -0500
+++ nopathretry-fix/multipath/main.c	2006-02-16 09:33:32.000000000 -0500
@@ -502,10 +502,33 @@ select_action (struct multipath * mpp, v
 		condlog(3, "set ACT_CREATE: map wwid change");
 		return;
 	}
-		
+
+	if (strstr(cmpp->features, "queue_if_no_path")) {
+		if (mpp->no_path_retry == NO_PATH_RETRY_FAIL) {
+			mpp->action = ACT_RELOADFEATURES;
+			condlog(3, "set ACT_RELOADFEATURES: no_path_retry change");
+			/* 
+			 * Don't return. ACT_RELOADFEATURES only applies
+			 * if you require no other action
+			 */
+		}
+	} else {
+		if (mpp->no_path_retry == NO_PATH_RETRY_QUEUE
+				|| mpp->no_path_retry > 0) {
+			mpp->action = ACT_RELOADFEATURES;
+			condlog(3, "set ACT_RELOADFEATURES: no_path_retry change");
+			/* 
+			 * Don't return. ACT_RELOADFEATURES only applies
+			 * if you require no other action
+			 */
+		}
+	}
+
 	if (pathcount(mpp, PATH_UP) == 0) {
-		mpp->action = ACT_NOTHING;
-		condlog(3, "set ACT_NOTHING: no usable path");
+		if (mpp->action != ACT_RELOADFEATURES) {
+			mpp->action = ACT_NOTHING;
+			condlog(3, "set ACT_NOTHING: no usable path");
+		}
 		return;
 	}
 	if (strncmp(cmpp->features, mpp->features, strlen(mpp->features))) {
@@ -517,7 +540,7 @@ select_action (struct multipath * mpp, v
 			condlog(3, "set ACT_RELOAD: features change");
 			return;
 		}
-		/* no_path_retry can never trigger an ACT_RELOADFEATURES */
+
 		if (mpp->pg_timeout){
 			ptr = strstr(cmpp->features, "pg_timeout=");
 			if ((!ptr && mpp->pg_timeout > 0) ||
@@ -645,9 +668,11 @@ domap (struct multipath * mpp)
 
 	switch (mpp->action) {
 	case ACT_NOTHING:
-	case ACT_RELOADFEATURES:
 		return 0;
 
+	case ACT_RELOADFEATURES:
+		return 2;
+
 	case ACT_SWITCHPG:
 		dm_switchgroup(mpp->alias, mpp->bestpg);
 		/*
@@ -813,10 +838,9 @@ coalesce_paths (vector curmp, vector pat
 		if (r < 0)
 			return r;
 
-
-		if (r || mpp->action == ACT_RELOADFEATURES) {
+		if (r) {
 			condlog(3, "reloading features");
-			if (r && mpp->no_path_retry != NO_PATH_RETRY_UNDEF) {
+			if (mpp->no_path_retry != NO_PATH_RETRY_UNDEF) {
 				if (mpp->no_path_retry == NO_PATH_RETRY_FAIL)
 					dm_queue_if_no_path(mpp->alias, 0);
 				else


Comment 2 Ben Marzinski 2006-02-28 20:53:51 UTC
Personally, I think that the code is correct as it currently is. Unfortunately,
this area of the code is ugly for two reasons

1. The multipath command currently does much more than it should (and hopefully
much more than it will in future releases). multipath and multipathd duplicate
a lot of the same functionality, when in truth, multipath should simply pass the
request on to multipathd whenever that function is up and running.

2. U3 depricated the U2 way of handling the "queue_if_no_path" feature (or maybe
U2 depricated it.. I don't quite remeber). At any rate, there are two methods,
One is to include "feature queue_if_no_path", the other is to include
"no_path_retry queue". These work differently. Specifically, the first method
doesn't allow for a limited number or retrys, while the second one does.

However, the intent of the code is to force multipath to defer to multipathd for
the "queue_if_no_path" feature if it was set via the second method.  It has to
do this because of the way no_path_retry handles the "limited number of retrys"
mode. This mode is handled by having multipathd set the "queue_if_no_path"
feature for the device, and then keep a counter of the number of consecutive
retry failures. If that counter reaches the limit set by the no_path_retry
option, multipathd removes the "queue_if_no_path" feature for the device. The
result of this is that the "queue_if_no_path" value visible to dm, and thus to
multipath may not actually reflect the actual value of no_path_retry that is
being tracked by multipathd.

Say for instanace that you have a device with no_paths_retry set to 10. All
paths fail, 10 retrys happen, and so the device fails.. All future IO fails
until a path comes back.  With the patches proposed above, if you run multipath,
that device will start queuing IO again, maybe for 10 more retrys... maybe
forever, since multipathd might not realize that you changed the value out from
under it, and start retry countdown again. I'm not sure which. Either way,
it's not the desired behavior.

The real solution to this problem is to run
# multipathd -k reconfigure
after updating your multipath.conf, instead of rerunning multipath.

In the future, multipath will eventually be changed so that running multipath
will simply tell multipathd to reconfigure.

Comment 3 Ben Marzinski 2006-02-28 21:04:52 UTC
Oops. The correct command is

# multipathd -k"reconfigure"


Comment 4 Lan Tran 2006-03-01 19:58:54 UTC
Yeah, using 'multipathd -k' for now is cool with me. My major concern was that
users may not realize that and expect that by editing multipath.conf and running
'multipath', all the setting changes will be persistent (which is not currently
the case for any multipathd used parameters like no_path_retry). So, until code
cleanup takes place, it may be a good idea to document this in RedHat
multipath-tools documentation? And especially, since multipathd CLI is not
actually officially documented to distro users either, perhaps, they should be
asked to just restart multipathd instead of reconfiguring (i.e.
/etc/init.d/multipathd restart) ?  
Thanks!

Comment 5 Ben Marzinski 2006-04-19 21:16:00 UTC
The multipath documentation now states that multipathd should be restarted after
changes to the configuration file.

Comment 11 Red Hat Bugzilla 2006-08-10 21:45:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2006-0513.html