Bug 1410541 - crm_simulate -s: Provide source of score calculations
Summary: crm_simulate -s: Provide source of score calculations
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: pacemaker
Version: 8.0
Hardware: x86_64
OS: Linux
low
low
Target Milestone: pre-dev-freeze
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-05 17:15 UTC by John Ruemker
Modified: 2020-07-05 15:05 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Feature Request
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2867361 None None None 2018-08-06 17:37:18 UTC

Description John Ruemker 2017-01-05 17:15:33 UTC
Description of problem: 'crm_simulate -s' is very handy for determining what node pacemaker would consider to be the best target for allocation of a resource or group.  However, in complex configurations with many constraints and large groups, just having the raw allocation score may not be enough to help understand _why_ a node is preferred.  Being able to see additional context for what contributed to a particular score would be useful.  

Being able to determine what will happen with a resource in certain scenarios, or why something did get placed the way it did, is quite important to some of our customers.  We see often that admins will try to predict what will happen in response to actions they are taking or conditions they expect to arise in an effort to prepare for those operations.  But if they see something didn't/won't go according to your preference, they may need a way to further understand what they need to tweak to get the right behavior.

Having crm_simulate offer additional background for its score calculations would help with that.  Something as simple as showing the breakdown of each contribution of points to a score - such as by constraint name, or by some descriptive string representing the reason for that contribution - would make understanding placement much easier.

Version-Release number of selected component (if applicable): pacemaker-1.1.15-11.el7_3.2


How reproducible: 


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Andrew Beekhof 2017-01-09 02:38:01 UTC
Completely agree that this information can be highly useful, any objection is on grounds of practicality - which is why we've generally directed people to tweak values and see how the final scores change with crm_simulate -sx and/or the horrifically verbose debug output.

Consider one place that scores are modified:

    for (gIter = rsc->rsc_cons_lhs; gIter != NULL; gIter = gIter->next) {
        rsc_colocation_t *constraint = (rsc_colocation_t *) gIter->data;

        rsc->allowed_nodes =
            constraint->rsc_lh->cmds->merge_weights(
                constraint->rsc_lh, rsc->id, rsc->allowed_nodes,
                constraint->node_attribute,
                (float)constraint->score / INFINITY,
                pe_weights_rollback);
    }

Since rsc->allowed_nodes is a hashtable of node_t, we'd need to add an ordered list of { score_delta, description } to the node_t struct for tracking changes.

I could imagine maintaining that extra list could get rather costly in exactly the kind of large complex installations that it would be useful in.
Keep in mind too, that since this logic is in a library shared between the cli and cluster, the main pengine process would also incur the overhead. 

Another potential wrinkle, the score we bring across from constraint->rsc_lh will have also been calculated from a number of different constraints, should one include a rolled up value { -40, "anti-colocation with X" } or each of the changes that summed up into the -40?  I can imagine each would be useful in different situations.

Another source of problems is that if location constraints are in use, the scores will change based on where things are running.
If A is colocated with B, then the user might see { -4000, "anti-colocation with X on Y" } if B is running on its most preferred node, but { -40, "anti-colocation with X on Z" } on another node.

Comment 2 Ken Gaillot 2017-01-09 17:46:45 UTC
Andrew's first mentioned complication (the cost of tracking this) could potentially be addressed by having a flag somewhere to "track score details", only set if a new option is passed to crm_simulate.

However the other complications are inherent. At some level, Pacemaker's policy engine is a rudimentary AI, and teasing out how it gets from point A to point B is impractical. I'm not sure a user-friendly crm_simulate output for this is a reasonable goal. Perhaps instead we could add more detail to log messages at the debug or trace level, but that's difficult for end users to follow.

Definitely not 7.4 timeframe

Comment 3 Andrew Beekhof 2017-01-09 23:31:16 UTC
(In reply to Ken Gaillot from comment #2)
> Andrew's first mentioned complication (the cost of tracking this) could
> potentially be addressed by having a flag somewhere to "track score
> details", only set if a new option is passed to crm_simulate.

agreed though i'm still a bit worried about all the functions that would need to change their parameter lists to support this ('this' being the overall feature and access to the flag specifically). not impossible, just invasive.

> 
> However the other complications are inherent. At some level, Pacemaker's
> policy engine is a rudimentary AI, and teasing out how it gets from point A
> to point B is impractical. I'm not sure a user-friendly crm_simulate output
> for this is a reasonable goal. Perhaps instead we could add more detail to
> log messages at the debug or trace level, but that's difficult for end users
> to follow.

potentially we could add a tag to the relevant log messages so that they alone can be selectively enabled.  the challenge there is that the PE does a lot of walking up and down the resource stack and the admin is likely to see a bunch of calculations that eventually get thrown out.

> 
> Definitely not 7.4 timeframe

definitely :)

Comment 5 Ken Gaillot 2017-08-01 15:34:45 UTC
Not 7.5 either, likely to end up WONTFIX unless we get more development capacity


Note You need to log in before you can comment on or make changes to this bug.