Bug 1931023
| Summary: | A non-promoted clone instance gets relocated when a cloned resource starts on a node with higher promotable score | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Reid Wahl <nwahl> | |
| Component: | pacemaker | Assignee: | Reid Wahl <nwahl> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | high | Docs Contact: | Steven J. Levine <slevine> | |
| Priority: | high | |||
| Version: | 8.3 | CC: | amemon, cfeist, cluster-maint, jserrano, kgaillot, msmazova, phagara, slevine | |
| Target Milestone: | rc | Keywords: | Triaged | |
| Target Release: | 8.9 | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | pacemaker-2.1.6-4.el8 | Doc Type: | Bug Fix | |
| Doc Text: |
.Unpromoted clone instances no longer restart unnecessarily
Previously, promotable clone instances were assigned in numerical order, with promoted instances first. As a result, if a promoted clone instance needed to start, an unpromoted instance in some cases restarted unexpectedly, because the instance numbers changed. With this fix, roles are considered when assigning instance numbers to nodes and as a result no unnecessary restarts occur.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2222055 (view as bug list) | Environment: | ||
| Last Closed: | 2023-11-14 15:32:34 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2222055 | |||
I have a basic understanding of the issue right now, but I'm not sure yet of what needs to be done to fix it. It's likely that Ken will beat me to it, even if he doesn't get around to it for a while ;) There seems to be little differentiation (at best a blurry line) between the promotable score and the rest of the contributors to the allocation score (e.g., stickiness, constraints, etc.). Assume that the stateful resource is promoted on node1 and non-promoted on node2. As background, before we inject a stop operation on node1, stateful:0 is in Master state on node1, and stateful:1 is in Slave state on node2. After the stop operation, stateful:0 is found in Stopped state on node1 and stateful:0 is found in Slave state on node2; stateful:1 is no longer used since there's only one active instance. # # Before (unpack_lrm_resource) trace: Unpacking lrm_resource for stateful on node1 (find_anonymous_clone) trace: Resource stateful:0, empty slot (unpack_find_resource) debug: Internally renamed stateful on node1 to stateful:0 (process_rsc_state) trace: Resource stateful:0 is Master on node1: on_fail=ignore (unpack_lrm_resource) trace: Unpacking lrm_resource for stateful on node2 (find_anonymous_clone) trace: Resource stateful:1, empty slot (unpack_find_resource) debug: Internally renamed stateful on node2 to stateful:1 (process_rsc_state) trace: Resource stateful:1 is Slave on node2: on_fail=ignore # # After (unpack_lrm_resource) trace: Unpacking lrm_resource for stateful on node1 (find_anonymous_clone) trace: Resource stateful:0, empty slot (unpack_find_resource) debug: Internally renamed stateful on node1 to stateful:0 (process_rsc_state) trace: Resource stateful:0 is Stopped on node1: on_fail=ignore (unpack_lrm_resource) trace: Unpacking lrm_resource for stateful on node2 (find_anonymous_clone) trace: Resource stateful:0, empty slot (process_rsc_state) trace: Resource stateful:0 is Slave on node2: on_fail=ignore When there's no stickiness configured, a clone instance gets a default stickiness score of 1 for its current node. In this case, node1's promotable score is 10, node2's promotable score is 5, the resource is stopped (via injected operation) on node1, and the resource is in non-promoted state on node2. (pcmk__clone_allocate) trace: pcmk__clone_allocate: stateful:0 allocation score on node1: 10 # promotable score for node1 (pcmk__clone_allocate) trace: pcmk__clone_allocate: stateful:0 allocation score on node2: 6 # promotable score for node2, plus default stickiness of 1 (pcmk__clone_allocate) trace: pcmk__clone_allocate: stateful:1 allocation score on node1: 10 # promotable score for node1 (pcmk__clone_allocate) trace: pcmk__clone_allocate: stateful:1 allocation score on node2: 5 # promotable score for node2 With the scores being equal and stateful:0 appearing before stateful:1 in the sort order, we allocate stateful:0 to node1. We promote it there thanks to node1's higher promotable score. Recall that stateful:0 was previously active on node2. Since we just allocated stateful:0 to node1, we now have to move stateful:0 from node2 to node1 and then start stateful:1 on node. As far as I can tell, this is entirely unnecessary since the clone is anonymous (globally-unique=false). One instance is the same as another besides the rsc->id. (LogAction) notice: * Move stateful:0 ( Slave node2 -> Master node1 ) (LogAction) notice: * Start stateful:1 ( node2 ) On the other hand, when the default resource-stickiness is set to 6 (or even 5), node2's promotable score plus the stickiness is greater than node1's promotable score. So if we repeat the scenario shown above, stateful:0's allocation score for node2 is higher than its allocation score for node1, and the scheduler doesn't move it. Instead, it leaves stateful:0 on node2, and it starts and promotes stateful:1 on node1. In this case, there's no downtime for the non-promoted stateful resource on node2. (pcmk__clone_allocate) trace: pcmk__clone_allocate: stateful:0 allocation score on node1: 10 # promotable score for node1 (pcmk__clone_allocate) trace: pcmk__clone_allocate: stateful:0 allocation score on node2: 11 # promotable score for node2, plus stickiness of 6 (pcmk__clone_allocate) trace: pcmk__clone_allocate: stateful:1 allocation score on node1: 10 # promotable score for node1 (pcmk__clone_allocate) trace: pcmk__clone_allocate: stateful:1 allocation score on node2: 5 # promotable score for node2 So at a high level, it seems to me that we need to prevent stateful:0 from relocating based on the promotable score (or based on the overall allocation score) -- when we have the option of simply starting a different instance (e.g., stateful:1) on the to-be-started-and-promoted node, instead of disrupting things. (In reply to Reid Wahl from comment #0) > We've often > seen "Pre-allocation failed" messages in SAP HANA clusters (which use > promotable clones), going back to RHEL 7. As an aside, that message was always unnecessarily alarming. In recent versions, it's info level rather than notice, and looks like "Not pre-allocating rsc1:0 to node1 because node2 is better". Bumping to 8.9 due to QA capacity Fixed in upstream main branch as of commit b561198 after fix: --------- > [root@virt-026 ~]# rpm -q pacemaker > pacemaker-2.1.6-6.el8.x86_64 Configure a promotable clone resource in a two-node cluster: > [root@virt-026 ~]# pcs resource create stateful ocf:pacemaker:Stateful promotable Each node has a different promotable score. Node "virt-026" has promotable score 10 and node "virt-025" has promotable score 5: > [root@virt-026 ~]# pcs status --full > Cluster name: STSRHTS23384 > Cluster Summary: > * Stack: corosync (Pacemaker is running) > * Current DC: virt-025 (1) (version 2.1.6-6.el8-6fdc9deea29) - partition with quorum > * Last updated: Fri Aug 11 15:51:16 2023 on virt-025 > * Last change: Fri Aug 11 15:51:09 2023 by root via cibadmin on virt-025 > * 2 nodes configured > * 4 resource instances configured > Node List: > * Node virt-025 (1): online, feature set 3.17.4 > * Node virt-026 (2): online, feature set 3.17.4 > Full List of Resources: > * fence-virt-025 (stonith:fence_xvm): Started virt-025 > * fence-virt-026 (stonith:fence_xvm): Started virt-026 > * Clone Set: stateful-clone [stateful] (promotable): > * stateful (ocf::pacemaker:Stateful): Master virt-026 > * stateful (ocf::pacemaker:Stateful): Slave virt-025 > Node Attributes: > * Node: virt-025 (1): > * master-stateful : 5 > * Node: virt-026 (2): > * master-stateful : 10 > Migration Summary: > Tickets: > PCSD Status: > virt-025: Online > virt-026: Online > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled There are no resource defaults: > [root@virt-026 ~]# pcs resource defaults > No defaults set Inject a stop operation simulation for the resource on the promoted node "virt-026": > [root@virt-026 ~]# crm_simulate -LR --op-inject stateful_stop_0@virt-026=0 > Current cluster status: > * Node List: > * Online: [ virt-025 virt-026 ] > * Full List of Resources: > * fence-virt-025 (stonith:fence_xvm): Started virt-025 > * fence-virt-026 (stonith:fence_xvm): Started virt-026 > * Clone Set: stateful-clone [stateful] (promotable): > * Masters: [ virt-026 ] > * Slaves: [ virt-025 ] > Performing Requested Modifications: > * Injecting stateful_stop_0@virt-026=0 into the configuration > Transition Summary: > * Promote stateful:1 ( Stopped -> Master virt-026 ) RESULT: Resource instance did not move. It was re-promoted on node "virt-026" (node with the higher promotable score). marking VERIFIED in pacemaker-2.1.6-6.el8 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:6970 |
Description of problem: If the instance of a promotable clone is running on node 2, and the clone is about to start and promote on node 1 (with a higher promotable score), then the non-promotable clone instance relocates. The effect is an unnecessary restart on the non-promoted node. So far, for simplicity, this has only been tested on a two-node cluster with promotable-max defaulting to 1. I think a demonstration is easier to follow than a description. Promotable score is 10 on node1 and 5 on node2. # crm_attribute -G -t status -n master-stateful -N node1 scope=status name=master-stateful value=10 # crm_attribute -G -t status -n master-stateful -N node2 scope=status name=master-stateful value=5 Promoted instance is running on node1. # crm_resource --locate -r stateful resource stateful is running on: node2 resource stateful is running on: node1 Master There is no default resource-stickiness. Note: This display seems buggy. # pcs resource defaults Meta Attrs: rsc_defaults-meta_attributes Now we simulate the node1 instance being in stopped state by injecting a successful stop operation. This may happen transiently in a live cluster when the promoted instance fails its monitor operation and tries to recover and re-promote in place. Instead of simply starting and re-promoting the stateful instance on node1, the scheduler relocates node2's instance to node1 and starts a new instance on node2. This requires the resource to stop and restart on node2. # crm_simulate -LR --op-inject stateful_stop_0@node1=0 ... Performing requested modifications + Injecting stateful_stop_0@node1=0 into the configuration Transition Summary: * Move stateful:0 ( Slave node2 -> Master node1 ) * Start stateful:1 ( node2 ) Now we repeat the test with stickiness greater than or equal to the difference between master scores. Here, we set stickiness so that node2's promotable score + stickiness == node1's promotable score. # pcs resource defaults resource-stickiness=5 # crm_simulate -LR --op-inject stateful_stop_0@node1=0 ... Performing requested modifications + Injecting stateful_stop_0@node1=0 into the configuration Transition Summary: * Promote stateful:1 ( Stopped -> Master node1 ) Finally, we reduce the stickiness by one point and test again, so that node2's promotable score + stickiness < node1's promotable score, to demonstrate that this is the source of the issue. The behavior is the same as when we had no stickiness. # pcs resource defaults Meta Attrs: rsc_defaults-meta_attributes resource-stickiness=4 # crm_simulate -LR --op-inject stateful_stop_0@node1=0 ... Performing requested modifications + Injecting stateful_stop_0@node1=0 into the configuration Transition Summary: * Move stateful:0 ( Slave node2 -> Master node1 ) * Start stateful:1 ( node2 ) I believe it's been this way for a long time, possibly forever. We've often seen "Pre-allocation failed" messages in SAP HANA clusters (which use promotable clones), going back to RHEL 7. After such a message, the non-promoted instance relocates, causing a stop and start on the non-promoted node. This can add several minutes of downtime on an SAP HANA cluster, and it can also mess up the state of some node attributes that the SAPHana resource agent sets upon startup. ----- Version-Release number of selected component (if applicable): pacemaker-2.0.4-6.el8_3.1 / master ----- How reproducible: Always ----- Steps to Reproduce: 1. Configure a promotable clone resource (e.g., named "stateful") in a two-node cluster. 2. Set a promotable score for the resource on each node, with each node having a different score. 3. Inject a stop operation for the resource on the promoted node. As long as the promotable scores are the same, this will trigger the simulation to try to start and promote the resource in place. For example, if the promoted instance was running on node1, inject the stop operation there, and Pacemaker should schedule the resource to promote again on node1. # crm_simulate -LR --op-inject stateful_stop_0@node1=0 ----- Actual results: Pacemaker schedules the non-promoted instance to relocate to node with the higher promotable score and promote there. It schedules a new instance to start on the non-promoted node. ----- Expected results: Pacemaker leaves the non-promoted instance in place and schedules another instance to promote on the node with the higher promotable score. ----- Additional info: A workaround is to set a resource-stickiness score higher than the difference between promotable scores.