Bug 247579

Summary:	LUN (configured with multipath) removal on Clariion storage cause lv commands freeze for 10 minutes
Product:	Red Hat Enterprise Linux 5	Reporter:	Jian Wang <jiawang>
Component:	lvm2	Assignee:	Ben Marzinski <bmarzins>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.0	CC:	agk, bmarzins, bmr, coughlan, dwysocha, heinzm, mbroz, prockai
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-09-14 14:49:11 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jian Wang 2007-07-10 02:53:40 UTC

Description of problem:
When a LUN is removed from Clariion stoage which was configured with multipath,  
lv commands will freeze for 10 minutes. Even restore back the LUN cannot 
make it stop freezing.



Version-Release number of selected component (if applicable):
RHEL5

How reproducible:
Very Reproducible

Steps to Reproduce:
1.Configure mulipath on Clariion storage device
2.Remove one LUN 
3.Run lv commands like "lvs" or "lvdisplay"
  
Actual results:
LV commands will freeze for 10 minutes, 
Ctrl+C cannot stop the command.
After 10 minutes, LV command will show the correct output.

Expected results:
Should not freeze for 10 minutes.  

Additional info:
Test the same procedure on HP storage without any problem.

We found this bug when trying to reproduce the bug 238421.

Comment 1 Jian Wang 2007-07-10 13:45:27 UTC

After 10 minutes when it resume and show the result, 
/var/log/messages also has the following log information:
"Jun 4 14:31:47 xeon3 multipathd: mpath2: Disable queueing" 

Seems queueing is causing the display to wait for such a long time.
So I changed multipath.conf:
      no_path_retry          300   ==>   no_path_retry          fail
It have no effect on the waiting time, instead I have to change
the source file /multipath-tools-0.4.7.rhel5.2/libmultipath/hwtable.c line 176
from 60 to NO_PATH_RETRY_UNDEF, to make it fail directly when there are
no other paths to retry. It solves the problem temporary.

It should be a bug that configuration file doesn't take effect on the
no_path_retry entry,
and it seems to me that multipath.conf entries are overwritten by "hwtable" in
config.c(method load_config) for this case. 

I guess loading "hwtable" first and then load user configuration would solve
this problem too, but not sure it will have other side-effects.

Comment 2 Jian Wang 2007-07-20 18:58:53 UTC

I changed multipath.conf from 
#               no_path_retry          300
to
                no_path_retry          fail
and then do 
/etc/init.d/multipathd restart

The change never affect the result of the following commands:
[root@xeon3 ~]# dmsetup table mpath11
0 2179072 multipath 1 queue_if_no_path 1 emc 1 1 round-robin 0 1 1 71:704 1000 

[root@xeon3 ~]# multipath -v3 |grep no_path_retry
mpath77: no_path_retry = 60 (controller setting)
mpath79: no_path_retry = 60 (controller setting)
mpath524: no_path_retry = 60 (controller setting)
  ....
mpath89: no_path_retry = 60 (controller setting)
mpath529: no_path_retry = 60 (controller setting)
mpath91: no_path_retry = 60 (controller setting)
mpath530: no_path_retry = 60 (controller setting)


By analysing the multipath-tools code
in libmultipath/propsel.c
select_no_path_retry(struct multipath *mp)
{
	if (mp->mpe && mp->mpe->no_path_retry != NO_PATH_RETRY_UNDEF) {
		mp->no_path_retry = mp->mpe->no_path_retry;
		condlog(3, "%s: no_path_retry = %i (multipath setting)",
			mp->alias, mp->no_path_retry);
		return 0;
	}
	if (mp->hwe && mp->hwe->no_path_retry != NO_PATH_RETRY_UNDEF) {
		mp->no_path_retry = mp->hwe->no_path_retry;
		condlog(3, "%s: no_path_retry = %i (controller setting)",
			mp->alias, mp->no_path_retry);
		return 0;
	}
	if (conf->no_path_retry != NO_PATH_RETRY_UNDEF) {
		mp->no_path_retry = conf->no_path_retry;
		condlog(3, "%s: no_path_retry = %i (config file default)",
			mp->alias, mp->no_path_retry);
		return 0;
	}
	mp->no_path_retry = NO_PATH_RETRY_UNDEF;
	condlog(3, "%s: no_path_retry = NONE (internal default)",
		mp->alias);
	return 0;
}

we found that multipath will first select controller default setting from
hwtable.c and then select configuration file's default value.

Multipath is doing this way for all the configurable parameters like 
get_prio/get_uid/pgpolicy etc.
So my questions is: 
Is this sequence of loading reasonable?

And by the way,  all controllers' default value for no_path_retry 
are set to NO_PATH_RETRY_UNDEF except COMPAQ/HP HSV and EMC clariion.
So HP HSV might have the same problem as well,with HP HSV's no_path_retry
 default value being 60.

Comment 3 Ben Marzinski 2007-10-29 20:49:04 UTC

I have a few questions about this.  Is there only one path to the device?  The
no_path_retry parameter should only effect operation when the last path is
removed. If there are still active paths, and you are queuing IO until the
no_path_retry limit is reached, then that is a problem all by itself.

However, the configuration loading order in not a problem. First multipath tries
to load from the multipath specific parameters, then the controller specific
parameters, then the parameters specified in the defaults section of the config
file. If none of these set a value for the parameter, then it uses a sensible
compiled in default.

hwtable.c sets some compiled in controller specific defaults.  These are checked
 along with the user defined controller specific parameters. However, user
supplied parameters are given priority. So, user supplied parameters in the 
devices section of the config file overwrite the compiled in ones for that
controller.  I am assuming that you set no_path_retry in the defaults section of
the config file.

Comment 5 Tom Coughlan 2010-03-05 18:31:46 UTC

No reply for over 2 years, and no other reports like this. Closing.