Bug 570513 - [NetApp 5.6 feat] Path switching inefficiency in DM-Multipath
Summary: [NetApp 5.6 feat] Path switching inefficiency in DM-Multipath
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath
Version: 5.6
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 5.6
Assignee: Ben Marzinski
QA Contact: Gris Ge
URL:
Whiteboard:
: 656919 (view as bug list)
Depends On:
Blocks: 557597 570546 593009
TreeView+ depends on / blocked
 
Reported: 2010-03-04 15:37 UTC by Martin George
Modified: 2011-01-13 23:02 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
The "pg_prio_calc" option was added to multipath.conf default options. By default, the option is set to "sum" and group priority is calculated as the sum of its path priorities. If set to "avg", multipath calculates priorities using the average priority of the paths in the group.
Clone Of:
: 570546 593009 (view as bug list)
Environment:
Last Closed: 2011-01-13 23:02:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0074 0 normal SHIPPED_LIVE device-mapper-multipath bug fix and enhancement update 2011-01-12 17:22:03 UTC

Description Martin George 2010-03-04 15:37:31 UTC
Description of problem:
Currently dm-multipath chooses paths for IO failover/failback on the basis of path group priority alone. This could lead to scenarios where IO is inefficiently routed through lower performance paths, even when higher performance paths are available.

For e.g. on an ALUA enabled setup, dm-multipath assigns a path priority of 50 to an active/optimized path & 10 to an active/non-optimized path. So suppose you have 2 path groups - 1st group comprising of 1 active/optimized path @ 100 GB/sec (totaling a group priority of 50) & 2nd group comprising of 6 active/non-optimized paths @ 1 GB/sec each (totaling a group priority of 60), dm-multipath would still choose the 2nd group for failover/failback since that has a higher group priority i.e. it would still choose not to use the additional 94 GB/sec that would be available to it had it used the single active/optimized path - not an ideal behavior.

Dm-multipath should ideally switch paths on the basis of access state alone. Several lower performance paths need/should not be better than fewer higher performance paths i.e. the active/non-optimized paths should be chosen for IO switching if and only if there are no active/optimized paths available at that point of time.

Version-Release number of selected component (if applicable):
RHEL 5.4 GA (2.6.18-164.el5)
device-mapper-multipath-0.4.7-30.el5

How reproducible:
Always

Comment 1 Ben Marzinski 2010-03-08 22:55:55 UTC
device-mapper-multipath will always choose the path group based on the path group priority alone.  The question is, how should we determine the path group priority. In RHEL5, the path group priority is equal to the sum of the priorities of all the paths in the group.  In RHEL6, the path grop priority is equal to the average of the priorities of all the paths in the group.  These two methods cause different problems.

In RHEL5, you can have the issue mentioned in this bugzilla, where the many bad paths are picked over a few good paths.

In RHEL6, you can have the issue where a few slightly better paths get picked over many slightly worse paths.

There are two places were things can be changed to effect how path groups are chosen.  The path priorities can be adjusted in the prioritizers (which are callouts in RHEL5 and functions loaded from DSOs in RHEL6), or multipath can change the way it calculates the path group priorities based on the individual path priorities.

For RHEL5, I'm not sure we want to change how the path group priorities are calculated this late in it's life.  There are customers who have written their own priority callouts, so we're not sure what effect changing the path group priority calculations will have on all of our customers.

Since we clearly have a problem with path groups containing ALUA active/optimized paths not being weighted high enough, we can simply change how much weight the mpath_prio_alua callout assigns to active/optimized paths to something high enough that we won't hit this problem anymore.  This way, we can reasonably believe that everyone who is effected by this change will see a performance improvement because of it.

Any objections?

Comment 2 Martin George 2010-03-10 12:03:56 UTC
Well, I find the RHEL6 design of choosing the path group priority as the average (as opposed to the sum) of individual path priorities to be the best here to avoid hitting the above issue. 

It would be ideal if this makes it to RHEL5 as well. How about making this as a  multipath.conf setting? You could have a new parameter like "group_priority" which would be set to "sum" or "average" as the case may be. In RHEL5, this could default to "sum", while in RHEL6 it would default to "average".

I'm really not in favor of increasing path weights as an alternative - there would still be corner cases where the above issue is hit. That's not a reliable workaround.

Comment 3 Tom Coughlan 2010-03-15 22:24:36 UTC
(In reply to comment #2)

> I'm really not in favor of increasing path weights as an alternative - there
> would still be corner cases where the above issue is hit. That's not a reliable
> workaround.    

I wonder about that. 

The problem here is that the performance difference between optimized and non-optimized is hard-coded (50 vs.10). It seems to me that a user should be able to specify this difference (maybe with a parameter to mpath_prio_alua ?), based on the characteristics of their hardware.  So, if I have a box where the difference between optimized and non-optimized is small, I can set this number low. This would allow a switch to non-optimized paths to happen when their combined performance is likely to be better than the optimized paths. It might also be useful for a possible future load balancer that includes non-optimized paths in the mix with optimized paths that are only slightly better performance.

Just some thoughts.

Comment 4 Martin George 2010-03-19 10:12:38 UTC
Well, that's another approach.

But the above issue has already been addressed in RHEL6 - by making the group path priority as the average of the individual path priorities. IMO, it would be good to have this feature in RHEL5 as well.

Since Ben raised the concern that back porting this design could affect existing RHEL5 group priority calculations, I suggested implementing this as an optional new multipath.conf setting, which could then cater to both existing & new dm-multipath users.

Comment 6 Tom Coughlan 2010-04-19 13:30:30 UTC
(In reply to comment #4)
> Well, that's another approach.
> 
> But the above issue has already been addressed in RHEL6 - by making the group
> path priority as the average of the individual path priorities. IMO, it would
> be good to have this feature in RHEL5 as well.

Yes, you are right, it is probably best for RHEL 5 to track the upstream (and RHEL 6)  direction here. As you said, the RHEL 5 default needs to stay the same, with the new behavior selected by a parameter. 

Tom

Comment 12 Ben Marzinski 2010-09-03 21:01:33 UTC
There is now a new multipath.conf defaults option, "pg_prio_calc".  By default it is set to "sum". Setting it to "avg" causes multipath to calculate priorities using the average priority of the paths in the group, just like multipath does in rhel 6.

Comment 13 Andrius Benokraitis 2010-09-03 22:20:43 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
There is now a new multipath.conf defaults option, "pg_prio_calc".  By default it is set to "sum". Setting it to "avg" causes multipath to calculate priorities using the average priority of the paths in the group, just like
multipath does in rhel 6.

Comment 15 Chris Ward 2010-11-09 13:37:32 UTC
~~ Attention Customers and Partners - RHEL 5.6 Public Beta is now available on RHN ~~

A fix for this 'OtherQA' BZ should be present and testable in the release. 

If this Bugzilla is verified as resolved, please update the Verified field above with an appropriate value and include a summary of the testing executed and the results obtained.

If you encounter any issues or have questions while testing, please describe them and set this bug into NEED_INFO. 

If you encounter new defects or have additional patches to request for inclusion, promptly escalate the new issues through your support representative.

Finally, future Beta kernels can be found here:
 http://people.redhat.com/jwilson/el5/

Note: Bugs with the 'OtherQA' keyword require Third-Party testing to confirm the request has been properly addressed. See: https://bugzilla.redhat.com/describekeywords.cgi#OtherQA ).

Comment 16 Ben Marzinski 2010-11-29 18:46:38 UTC
*** Bug 656919 has been marked as a duplicate of this bug. ***

Comment 17 Ben Marzinski 2010-11-29 18:49:12 UTC
There is a problem with this fix. Bug 656919 documents a crash that happens when all paths are down.

Comment 18 Eva Kopalova 2010-11-30 11:50:41 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,2 +1 @@
-There is now a new multipath.conf defaults option, "pg_prio_calc".  By default it is set to "sum". Setting it to "avg" causes multipath to calculate priorities using the average priority of the paths in the group, just like
+The "pg_prio_calc" option was added to multipath.conf default options. By default, the option is set to "sum" and group priority is calculated as the sum of its path priorities. If set to "avg", multipath calculates priorities using the average priority of the paths in the group.-multipath does in rhel 6.

Comment 20 Chris Ward 2010-12-02 15:29:51 UTC
Reminder! There should be a fix present for this BZ in snapshot 3 -- unless otherwise noted in a previous comment.

Please test and update this BZ with test results as soon as possible.

Comment 22 errata-xmlrpc 2011-01-13 23:02:53 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0074.html


Note You need to log in before you can comment on or make changes to this bug.