1410541 – crm_simulate -s: Provide source of score calculations

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1410541 - crm_simulate -s: Provide source of score calculations

Summary: crm_simulate -s: Provide source of score calculations

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	pre-dev-freeze
Target Release:	---
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-05 17:15 UTC by John Ruemker
Modified:	2021-08-30 13:19 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-01 07:28:03 UTC
Type:	Feature Request
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Cluster Labs	5458	0	None	None	None	2020-12-01 15:32:41 UTC
Red Hat Knowledge Base (Solution)	2867361	0	None	None	None	2018-08-06 17:37:18 UTC

Description John Ruemker 2017-01-05 17:15:33 UTC

Description of problem: 'crm_simulate -s' is very handy for determining what node pacemaker would consider to be the best target for allocation of a resource or group. However, in complex configurations with many constraints and large groups, just having the raw allocation score may not be enough to help understand _why_ a node is preferred. Being able to see additional context for what contributed to a particular score would be useful.

Being able to determine what will happen with a resource in certain scenarios, or why something did get placed the way it did, is quite important to some of our customers. We see often that admins will try to predict what will happen in response to actions they are taking or conditions they expect to arise in an effort to prepare for those operations. But if they see something didn't/won't go according to your preference, they may need a way to further understand what they need to tweak to get the right behavior.

Having crm_simulate offer additional background for its score calculations would help with that. Something as simple as showing the breakdown of each contribution of points to a score - such as by constraint name, or by some descriptive string representing the reason for that contribution - would make understanding placement much easier.

Version-Release number of selected component (if applicable): pacemaker-1.1.15-11.el7_3.2

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 1 Andrew Beekhof 2017-01-09 02:38:01 UTC

Completely agree that this information can be highly useful, any objection is on grounds of practicality - which is why we've generally directed people to tweak values and see how the final scores change with crm_simulate -sx and/or the horrifically verbose debug output.

Consider one place that scores are modified:

    for (gIter = rsc->rsc_cons_lhs; gIter != NULL; gIter = gIter->next) {
        rsc_colocation_t *constraint = (rsc_colocation_t *) gIter->data;

        rsc->allowed_nodes =
            constraint->rsc_lh->cmds->merge_weights(
                constraint->rsc_lh, rsc->id, rsc->allowed_nodes,
                constraint->node_attribute,
                (float)constraint->score / INFINITY,
                pe_weights_rollback);
    }

Since rsc->allowed_nodes is a hashtable of node_t, we'd need to add an ordered list of { score_delta, description } to the node_t struct for tracking changes.

I could imagine maintaining that extra list could get rather costly in exactly the kind of large complex installations that it would be useful in.
Keep in mind too, that since this logic is in a library shared between the cli and cluster, the main pengine process would also incur the overhead. 

Another potential wrinkle, the score we bring across from constraint->rsc_lh will have also been calculated from a number of different constraints, should one include a rolled up value { -40, "anti-colocation with X" } or each of the changes that summed up into the -40?  I can imagine each would be useful in different situations.

Another source of problems is that if location constraints are in use, the scores will change based on where things are running.
If A is colocated with B, then the user might see { -4000, "anti-colocation with X on Y" } if B is running on its most preferred node, but { -40, "anti-colocation with X on Z" } on another node.

Comment 2 Ken Gaillot 2017-01-09 17:46:45 UTC

Andrew's first mentioned complication (the cost of tracking this) could potentially be addressed by having a flag somewhere to "track score details", only set if a new option is passed to crm_simulate.

However the other complications are inherent. At some level, Pacemaker's policy engine is a rudimentary AI, and teasing out how it gets from point A to point B is impractical. I'm not sure a user-friendly crm_simulate output for this is a reasonable goal. Perhaps instead we could add more detail to log messages at the debug or trace level, but that's difficult for end users to follow.

Definitely not 7.4 timeframe

Comment 3 Andrew Beekhof 2017-01-09 23:31:16 UTC

(In reply to Ken Gaillot from comment #2)
> Andrew's first mentioned complication (the cost of tracking this) could
> potentially be addressed by having a flag somewhere to "track score
> details", only set if a new option is passed to crm_simulate.

agreed though i'm still a bit worried about all the functions that would need to change their parameter lists to support this ('this' being the overall feature and access to the flag specifically). not impossible, just invasive.

> 
> However the other complications are inherent. At some level, Pacemaker's
> policy engine is a rudimentary AI, and teasing out how it gets from point A
> to point B is impractical. I'm not sure a user-friendly crm_simulate output
> for this is a reasonable goal. Perhaps instead we could add more detail to
> log messages at the debug or trace level, but that's difficult for end users
> to follow.

potentially we could add a tag to the relevant log messages so that they alone can be selectively enabled.  the challenge there is that the PE does a lot of walking up and down the resource stack and the admin is likely to see a bunch of calculations that eventually get thrown out.

> 
> Definitely not 7.4 timeframe

definitely :)

Comment 5 Ken Gaillot 2017-08-01 15:34:45 UTC

Not 7.5 either, likely to end up WONTFIX unless we get more development capacity

Comment 9 RHEL Program Management 2020-12-01 07:28:03 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 10 Ken Gaillot 2020-12-01 15:32:43 UTC

An equivalent report has been filed upstream.

Note You need to log in before you can comment on or make changes to this bug.