Bug 593009

Summary: [NetApp 4.9 feat] Path switching inefficiency in DM-Multipath
Product: Red Hat Enterprise Linux 4 Reporter: Martin George <marting>
Component: device-mapper-multipathAssignee: LVM and device-mapper development team <lvm-team>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: agk, andriusb, bmarzins, bmr, christophe.varoqui, coughlan, dwysocha, edamato, egoggin, heinzm, iannis, junichi.nomura, kueda, lmb, mbroz, prockai, tranlan, xdl-redhat-bugzilla
Target Milestone: betaKeywords: FutureFeature, OtherQA
Target Release: 4.9   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: 570513 Environment:
Last Closed: 2010-05-20 02:08:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 570513    
Bug Blocks:    

Description Martin George 2010-05-17 15:29:22 UTC
+++ This bug was initially created as a clone of Bug #570513 +++

Description of problem:
Currently dm-multipath chooses paths for IO failover/failback on the basis of path group priority alone. This could lead to scenarios where IO is inefficiently routed through lower performance paths, even when higher performance paths are available.

For e.g. on an ALUA enabled setup, dm-multipath assigns a path priority of 50 to an active/optimized path & 10 to an active/non-optimized path. So suppose you have 2 path groups - 1st group comprising of 1 active/optimized path @ 100 GB/sec (totaling a group priority of 50) & 2nd group comprising of 6 active/non-optimized paths @ 1 GB/sec each (totaling a group priority of 60), dm-multipath would still choose the 2nd group for failover/failback since that has a higher group priority i.e. it would still choose not to use the additional 94 GB/sec that would be available to it had it used the single active/optimized path - not an ideal behavior.

Dm-multipath should ideally switch paths on the basis of access state alone. Several lower performance paths need/should not be better than fewer higher performance paths i.e. the active/non-optimized paths should be chosen for IO switching if and only if there are no active/optimized paths available at that point of time.

Version-Release number of selected component (if applicable):
RHEL 5.4 GA (2.6.18-164.el5)
device-mapper-multipath-0.4.7-30.el5

How reproducible:
Always

--- Additional comment from bmarzins on 2010-03-08 17:55:55 EST ---

device-mapper-multipath will always choose the path group based on the path group priority alone.  The question is, how should we determine the path group priority. In RHEL5, the path group priority is equal to the sum of the priorities of all the paths in the group.  In RHEL6, the path grop priority is equal to the average of the priorities of all the paths in the group.  These two methods cause different problems.

In RHEL5, you can have the issue mentioned in this bugzilla, where the many bad paths are picked over a few good paths.

In RHEL6, you can have the issue where a few slightly better paths get picked over many slightly worse paths.

There are two places were things can be changed to effect how path groups are chosen.  The path priorities can be adjusted in the prioritizers (which are callouts in RHEL5 and functions loaded from DSOs in RHEL6), or multipath can change the way it calculates the path group priorities based on the individual path priorities.

For RHEL5, I'm not sure we want to change how the path group priorities are calculated this late in it's life.  There are customers who have written their own priority callouts, so we're not sure what effect changing the path group priority calculations will have on all of our customers.

Since we clearly have a problem with path groups containing ALUA active/optimized paths not being weighted high enough, we can simply change how much weight the mpath_prio_alua callout assigns to active/optimized paths to something high enough that we won't hit this problem anymore.  This way, we can reasonably believe that everyone who is effected by this change will see a performance improvement because of it.

Any objections?

--- Additional comment from marting on 2010-03-10 07:03:56 EST ---

Well, I find the RHEL6 design of choosing the path group priority as the average (as opposed to the sum) of individual path priorities to be the best here to avoid hitting the above issue. 

It would be ideal if this makes it to RHEL5 as well. How about making this as a  multipath.conf setting? You could have a new parameter like "group_priority" which would be set to "sum" or "average" as the case may be. In RHEL5, this could default to "sum", while in RHEL6 it would default to "average".

I'm really not in favor of increasing path weights as an alternative - there would still be corner cases where the above issue is hit. That's not a reliable workaround.

--- Additional comment from coughlan on 2010-03-15 18:24:36 EDT ---

(In reply to comment #2)

> I'm really not in favor of increasing path weights as an alternative - there
> would still be corner cases where the above issue is hit. That's not a reliable
> workaround.    

I wonder about that. 

The problem here is that the performance difference between optimized and non-optimized is hard-coded (50 vs.10). It seems to me that a user should be able to specify this difference (maybe with a parameter to mpath_prio_alua ?), based on the characteristics of their hardware.  So, if I have a box where the difference between optimized and non-optimized is small, I can set this number low. This would allow a switch to non-optimized paths to happen when their combined performance is likely to be better than the optimized paths. It might also be useful for a possible future load balancer that includes non-optimized paths in the mix with optimized paths that are only slightly better performance.

Just some thoughts.

--- Additional comment from marting on 2010-03-19 06:12:38 EDT ---

Well, that's another approach.

But the above issue has already been addressed in RHEL6 - by making the group path priority as the average of the individual path priorities. IMO, it would be good to have this feature in RHEL5 as well.

Since Ben raised the concern that back porting this design could affect existing RHEL5 group priority calculations, I suggested implementing this as an optional new multipath.conf setting, which could then cater to both existing & new dm-multipath users.

--- Additional comment from coughlan on 2010-04-19 09:30:30 EDT ---

(In reply to comment #4)
> Well, that's another approach.
> 
> But the above issue has already been addressed in RHEL6 - by making the group
> path priority as the average of the individual path priorities. IMO, it would
> be good to have this feature in RHEL5 as well.

Yes, you are right, it is probably best for RHEL 5 to track the upstream (and RHEL 6)  direction here. As you said, the RHEL 5 default needs to stay the same, with the new behavior selected by a parameter. 

Tom

Comment 1 Martin George 2010-05-17 15:29:52 UTC
Tracking this feature for RHEL 4.9.

Comment 2 Andrius Benokraitis 2010-05-20 02:08:20 UTC
There are no features allowed for RHEL 4.9.