1931023 – A non-promoted clone instance gets relocated when a cloned resource starts on a node with higher promotable score

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1931023 - A non-promoted clone instance gets relocated when a cloned resource starts on a node with higher promotable score

Summary: A non-promoted clone instance gets relocated when a cloned resource starts on...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.9
Assignee:	Reid Wahl
QA Contact:	cluster-qe
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:	2222055
TreeView+	depends on / blocked

Reported:	2021-02-20 06:28 UTC by Reid Wahl
Modified:	2024-06-14 00:24 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pacemaker-2.1.6-4.el8
Doc Type:	Bug Fix
Doc Text:	.Unpromoted clone instances no longer restart unnecessarily Previously, promotable clone instances were assigned in numerical order, with promoted instances first. As a result, if a promoted clone instance needed to start, an unpromoted instance in some cases restarted unexpectedly, because the instance numbers changed. With this fix, roles are considered when assigning instance numbers to nodes and as a result no unnecessary restarts occur.
Clone Of:
Clones:	2222055 (view as bug list)
Environment:
Last Closed:	2023-11-14 15:32:34 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ClusterLabs pacemaker pull 2313	None	open	Fix: libpacemaker: Don't shuffle anonymous clone instances unnecessarily	2021-02-26 09:07:19 UTC
Red Hat Issue Tracker	CLUSTERQE-6691	None	None	None	2023-05-16 13:11:11 UTC
Red Hat Knowledge Base (Solution)	4849731	None	None	None	2021-02-20 06:28:54 UTC
Red Hat Product Errata	RHEA-2023:6970	None	None	None	2023-11-14 15:33:21 UTC

Description Reid Wahl 2021-02-20 06:28:54 UTC

Description of problem:

If the instance of a promotable clone is running on node 2, and the clone is about to start and promote on node 1 (with a higher promotable score), then the non-promotable clone instance relocates. The effect is an unnecessary restart on the non-promoted node.

So far, for simplicity, this has only been tested on a two-node cluster with promotable-max defaulting to 1. I think a demonstration is easier to follow than a description.

Promotable score is 10 on node1 and 5 on node2.
# crm_attribute -G -t status -n master-stateful -N node1
scope=status name=master-stateful value=10
# crm_attribute -G -t status -n master-stateful -N node2
scope=status name=master-stateful value=5

Promoted instance is running on node1.
# crm_resource --locate -r stateful
resource stateful is running on: node2
resource stateful is running on: node1 Master

There is no default resource-stickiness. Note: This display seems buggy.
# pcs resource defaults
Meta Attrs: rsc_defaults-meta_attributes

Now we simulate the node1 instance being in stopped state by injecting a successful stop operation. This may happen transiently in a live cluster when the promoted instance fails its monitor operation and tries to recover and re-promote in place.

Instead of simply starting and re-promoting the stateful instance on node1, the scheduler relocates node2's instance to node1 and starts a new instance on node2. This requires the resource to stop and restart on node2.
# crm_simulate -LR --op-inject stateful_stop_0@node1=0
...
Performing requested modifications
+ Injecting stateful_stop_0@node1=0 into the configuration

Transition Summary:
* Move stateful:0 ( Slave node2 -> Master node1 )
* Start stateful:1 ( node2 )

Now we repeat the test with stickiness greater than or equal to the difference between master scores. Here, we set stickiness so that node2's promotable score + stickiness == node1's promotable score.
# pcs resource defaults resource-stickiness=5
# crm_simulate -LR --op-inject stateful_stop_0@node1=0
...
Performing requested modifications
+ Injecting stateful_stop_0@node1=0 into the configuration

Transition Summary:
* Promote stateful:1 ( Stopped -> Master node1 )

Finally, we reduce the stickiness by one point and test again, so that node2's promotable score + stickiness < node1's promotable score, to demonstrate that this is the source of the issue. The behavior is the same as when we had no stickiness.
# pcs resource defaults
Meta Attrs: rsc_defaults-meta_attributes
resource-stickiness=4
# crm_simulate -LR --op-inject stateful_stop_0@node1=0
...
Performing requested modifications
+ Injecting stateful_stop_0@node1=0 into the configuration

Transition Summary:
* Move stateful:0 ( Slave node2 -> Master node1 )
* Start stateful:1 ( node2 )

I believe it's been this way for a long time, possibly forever. We've often seen "Pre-allocation failed" messages in SAP HANA clusters (which use promotable clones), going back to RHEL 7. After such a message, the non-promoted instance relocates, causing a stop and start on the non-promoted node. This can add several minutes of downtime on an SAP HANA cluster, and it can also mess up the state of some node attributes that the SAPHana resource agent sets upon startup.

-----

Version-Release number of selected component (if applicable):

pacemaker-2.0.4-6.el8_3.1 / master

-----

How reproducible:

Always

-----

Steps to Reproduce:

1. Configure a promotable clone resource (e.g., named "stateful") in a two-node cluster.
2. Set a promotable score for the resource on each node, with each node having a different score.
3. Inject a stop operation for the resource on the promoted node. As long as the promotable scores are the same, this will trigger the simulation to try to start and promote the resource in place. For example, if the promoted instance was running on node1, inject the stop operation there, and Pacemaker should schedule the resource to promote again on node1.

# crm_simulate -LR --op-inject stateful_stop_0@node1=0

-----

Actual results:

Pacemaker schedules the non-promoted instance to relocate to node with the higher promotable score and promote there. It schedules a new instance to start on the non-promoted node.

-----

Expected results:

Pacemaker leaves the non-promoted instance in place and schedules another instance to promote on the node with the higher promotable score.

-----

Additional info:

A workaround is to set a resource-stickiness score higher than the difference between promotable scores.

Comment 1 Reid Wahl 2021-02-20 07:07:52 UTC

I have a basic understanding of the issue right now, but I'm not sure yet of what needs to be done to fix it. It's likely that Ken will beat me to it, even if he doesn't get around to it for a while ;)

There seems to be little differentiation (at best a blurry line) between the promotable score and the rest of the contributors to the allocation score (e.g., stickiness, constraints, etc.).

Assume that the stateful resource is promoted on node1 and non-promoted on node2. As background, before we inject a stop operation on node1, stateful:0 is in Master state on node1, and stateful:1 is in Slave state on node2. After the stop operation, stateful:0 is found in Stopped state on node1 and stateful:0 is found in Slave state on node2; stateful:1 is no longer used since there's only one active instance.

# # Before
(unpack_lrm_resource)   trace: Unpacking lrm_resource for stateful on node1
(find_anonymous_clone)  trace: Resource stateful:0, empty slot
(unpack_find_resource)  debug: Internally renamed stateful on node1 to stateful:0
(process_rsc_state)     trace: Resource stateful:0 is Master on node1: on_fail=ignore

(unpack_lrm_resource)   trace: Unpacking lrm_resource for stateful on node2
(find_anonymous_clone)  trace: Resource stateful:1, empty slot
(unpack_find_resource)  debug: Internally renamed stateful on node2 to stateful:1
(process_rsc_state)     trace: Resource stateful:1 is Slave on node2: on_fail=ignore

# # After
(unpack_lrm_resource)   trace: Unpacking lrm_resource for stateful on node1
(find_anonymous_clone)  trace: Resource stateful:0, empty slot
(unpack_find_resource)  debug: Internally renamed stateful on node1 to stateful:0
(process_rsc_state)     trace: Resource stateful:0 is Stopped on node1: on_fail=ignore

(unpack_lrm_resource)   trace: Unpacking lrm_resource for stateful on node2
(find_anonymous_clone)  trace: Resource stateful:0, empty slot
(process_rsc_state)     trace: Resource stateful:0 is Slave on node2: on_fail=ignore


When there's no stickiness configured, a clone instance gets a default stickiness score of 1 for its current node. In this case, node1's promotable score is 10, node2's promotable score is 5, the resource is stopped (via injected operation) on node1, and the resource is in non-promoted state on node2.

(pcmk__clone_allocate)  trace: pcmk__clone_allocate: stateful:0 allocation score on node1: 10  # promotable score for node1
(pcmk__clone_allocate)  trace: pcmk__clone_allocate: stateful:0 allocation score on node2: 6   # promotable score for node2, plus default stickiness of 1
(pcmk__clone_allocate)  trace: pcmk__clone_allocate: stateful:1 allocation score on node1: 10  # promotable score for node1
(pcmk__clone_allocate)  trace: pcmk__clone_allocate: stateful:1 allocation score on node2: 5   # promotable score for node2


With the scores being equal and stateful:0 appearing before stateful:1 in the sort order, we allocate stateful:0 to node1. We promote it there thanks to node1's higher promotable score.

Recall that stateful:0 was previously active on node2. Since we just allocated stateful:0 to node1, we now have to move stateful:0 from node2 to node1 and then start stateful:1 on node. As far as I can tell, this is entirely unnecessary since the clone is anonymous (globally-unique=false). One instance is the same as another besides the rsc->id.

(LogAction)     notice:  * Move       stateful:0     ( Slave node2 -> Master node1 )  
(LogAction)     notice:  * Start      stateful:1     (                       node2 )  


On the other hand, when the default resource-stickiness is set to 6 (or even 5), node2's promotable score plus the stickiness is greater than node1's promotable score. So if we repeat the scenario shown above, stateful:0's allocation score for node2 is higher than its allocation score for node1, and the scheduler doesn't move it. Instead, it leaves stateful:0 on node2, and it starts and promotes stateful:1 on node1. In this case, there's no downtime for the non-promoted stateful resource on node2.

(pcmk__clone_allocate)  trace: pcmk__clone_allocate: stateful:0 allocation score on node1: 10  # promotable score for node1
(pcmk__clone_allocate)  trace: pcmk__clone_allocate: stateful:0 allocation score on node2: 11  # promotable score for node2, plus stickiness of 6
(pcmk__clone_allocate)  trace: pcmk__clone_allocate: stateful:1 allocation score on node1: 10  # promotable score for node1
(pcmk__clone_allocate)  trace: pcmk__clone_allocate: stateful:1 allocation score on node2: 5   # promotable score for node2


So at a high level, it seems to me that we need to prevent stateful:0 from relocating based on the promotable score (or based on the overall allocation score) -- when we have the option of simply starting a different instance (e.g., stateful:1) on the to-be-started-and-promoted node, instead of disrupting things.

Comment 4 Ken Gaillot 2021-03-10 23:07:33 UTC

(In reply to Reid Wahl from comment #0)
> We've often
> seen "Pre-allocation failed" messages in SAP HANA clusters (which use
> promotable clones), going back to RHEL 7.

As an aside, that message was always unnecessarily alarming. In recent versions, it's info level rather than notice, and looks like "Not pre-allocating rsc1:0 to node1 because node2 is better".

Comment 6 Ken Gaillot 2022-11-22 17:44:29 UTC

Bumping to 8.9 due to QA capacity

Comment 13 Ken Gaillot 2023-07-12 14:15:16 UTC

Fixed in upstream main branch as of commit b561198

Comment 21 Markéta Smazová 2023-08-11 13:59:27 UTC

after fix:
---------
>   [root@virt-026 ~]# rpm -q pacemaker
>   pacemaker-2.1.6-6.el8.x86_64

Configure a promotable clone resource in a two-node cluster:
>   [root@virt-026 ~]# pcs resource create stateful ocf:pacemaker:Stateful promotable

Each node has a different promotable score. Node "virt-026" has promotable score 10 and node "virt-025" has promotable score 5:
>   [root@virt-026 ~]# pcs status --full
>   Cluster name: STSRHTS23384
>   Cluster Summary:
>     * Stack: corosync (Pacemaker is running)
>     * Current DC: virt-025 (1) (version 2.1.6-6.el8-6fdc9deea29) - partition with quorum
>     * Last updated: Fri Aug 11 15:51:16 2023 on virt-025
>     * Last change:  Fri Aug 11 15:51:09 2023 by root via cibadmin on virt-025
>     * 2 nodes configured
>     * 4 resource instances configured

>   Node List:
>     * Node virt-025 (1): online, feature set 3.17.4
>     * Node virt-026 (2): online, feature set 3.17.4

>   Full List of Resources:
>     * fence-virt-025	(stonith:fence_xvm):	 Started virt-025
>     * fence-virt-026	(stonith:fence_xvm):	 Started virt-026
>     * Clone Set: stateful-clone [stateful] (promotable):
>       * stateful	(ocf::pacemaker:Stateful):	 Master virt-026
>       * stateful	(ocf::pacemaker:Stateful):	 Slave virt-025

>   Node Attributes:
>     * Node: virt-025 (1):
>       * master-stateful                 	: 5
>     * Node: virt-026 (2):
>       * master-stateful                 	: 10

>   Migration Summary:

>   Tickets:

>   PCSD Status:
>     virt-025: Online
>     virt-026: Online

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

There are no resource defaults:
>   [root@virt-026 ~]# pcs resource defaults
>   No defaults set

Inject a stop operation simulation for the resource on the promoted node "virt-026":
>   [root@virt-026 ~]# crm_simulate -LR --op-inject stateful_stop_0@virt-026=0
>   Current cluster status:
>     * Node List:
>       * Online: [ virt-025 virt-026 ]

>     * Full List of Resources:
>       * fence-virt-025	(stonith:fence_xvm):	 Started virt-025
>       * fence-virt-026	(stonith:fence_xvm):	 Started virt-026
>       * Clone Set: stateful-clone [stateful] (promotable):
>         * Masters: [ virt-026 ]
>         * Slaves: [ virt-025 ]

>   Performing Requested Modifications:
>     * Injecting stateful_stop_0@virt-026=0 into the configuration

>   Transition Summary:
>     * Promote    stateful:1     ( Stopped -> Master virt-026 )


RESULT: Resource instance did not move. It was re-promoted on node "virt-026" (node with the higher promotable score).


marking VERIFIED in pacemaker-2.1.6-6.el8

Comment 23 errata-xmlrpc 2023-11-14 15:32:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:6970

Note You need to log in before you can comment on or make changes to this bug.