1876173 – A resource in a negatively colocated group can remain stopped if it hits its migration threshold

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1876173 - A resource in a negatively colocated group can remain stopped if it hits its migration threshold

Summary: A resource in a negatively colocated group can remain stopped if it hits its ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.2
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	8.9
Assignee:	Ken Gaillot
QA Contact:	cluster-qe
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-06 02:48 UTC by Reid Wahl
Modified:	2023-12-15 19:11 UTC (History)
CC List:	6 users (show)
Fixed In Version:	pacemaker-2.1.6-1.el8
Doc Type:	Enhancement
Doc Text:	.Pacemaker's scheduler now tries to satisfy all mandatory colocation constraints before trying to satisfy optional colocation constraints Previously, colocation constraints were considered one by one regardless of whether they were mandatory or optional. This meant that certain resources could be unable to run even though a node assignment was possible. Pacemaker's scheduler now tries to satisfy all mandatory colocation constraints, including the implicit constraints between group members, before trying to satisfy optional colocation constraints. As a result, resources with a mix of optional and mandatory colocation constraints are now more likely to be able to run.
Clone Of:
Environment:
Last Closed:	2023-11-14 15:32:34 UTC
Type:	Bug
Target Upstream Version:	2.1.6
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5374451	0	None	None	None	2020-09-06 03:17:01 UTC
Red Hat Product Errata	RHEA-2023:6970	0	None	None	None	2023-11-14 15:33:21 UTC

Description Reid Wahl 2020-09-06 02:48:24 UTC

Description of problem:

Assume there are two groups with non-INFINITY negative colocation constraints between them, so that they prefer to run on separate nodes but are allowed to run on the same node. Say "dummyb with dummya -5000".

Let a resource in the colocated group (dummyb), that is not at the base of the group, have a migration threshold of 1. Call that resource dummyb_2.

When that resource fails, it remains in stopped state. It wants to migrate due to its migration threshold, but it cannot migrate because the resources closer to the base of the group prefer to remain on the current node.

```
[root@fastvm-rhel-8-0-23 pacemaker]# pcs config | egrep '(Group|Resource|Meta Attrs):'
 Group: dummya
  Resource: dummya_1 (class=ocf provider=heartbeat type=Dummy)
  Resource: dummya_2 (class=ocf provider=heartbeat type=Dummy)
 Group: dummyb
  Resource: dummyb_1 (class=ocf provider=heartbeat type=Dummy)
  Resource: dummyb_2 (class=ocf provider=heartbeat type=Dummy)
   Meta Attrs: migration-threshold=1

[root@fastvm-rhel-8-0-23 pacemaker]# pcs constraint colocation
Colocation Constraints:
  dummyb with dummya (score:-5000)

[root@fastvm-rhel-8-0-23 pacemaker]# pcs status
...
  * Resource Group: dummya:
    * dummya_1	(ocf::heartbeat:Dummy):	 Started node1
    * dummya_2	(ocf::heartbeat:Dummy):	 Started node1
  * Resource Group: dummyb:
    * dummyb_1	(ocf::heartbeat:Dummy):	 Started node2
    * dummyb_2	(ocf::heartbeat:Dummy):	 Started node2

[root@fastvm-rhel-8-0-23 pacemaker]# crm_resource --fail --resource dummyb_2 --node node2
Waiting for 1 reply from the controller. OK

[root@fastvm-rhel-8-0-23 pacemaker]# pcs status
  * Resource Group: dummya:
    * dummya_1	(ocf::heartbeat:Dummy):	 Started node1
    * dummya_2	(ocf::heartbeat:Dummy):	 Started node1
  * Resource Group: dummyb:
    * dummyb_1	(ocf::heartbeat:Dummy):	 Started node2
    * dummyb_2	(ocf::heartbeat:Dummy):	 Stopped
```

This behavior is not reproducible if migration-threshold=1 on dummya_2 and we cause resource dummya_2 to fail, since group dummya is placed first.

This isn't obviously a bug, as the behavior makes sense given the constraints and the migration-threshold.

pcmk__native_allocate: dummyb_1 allocation score on node1: -5000
pcmk__native_allocate: dummyb_1 allocation score on node2: 0
pcmk__native_allocate: dummyb_2 allocation score on node1: -INFINITY
pcmk__native_allocate: dummyb_2 allocation score on node2: -INFINITY

So, irrespective of the difficulty of doing so, I don't know that we would even want to change Pacemaker's behavior. However, it would be really nice to configure the cluster so that the dummyb group migrates if dummyb_2 hits its migration-threshold, while still respecting the negative colocation constraint in general. Right now, dummyb_1 blocks that from happening. Maybe there's a way to rig the configuration to do that.

-----

Version-Release number of selected component (if applicable):

master, and pacemaker-1.1.21-4.el7

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. Create a configuration like the one in the description.
2. Cause a failure of dummyb_2.

-----

Actual results:

dummyb_2 remains stopped, while dummyb_1 continues running on its original node.

-----

Expected results:

The dummyb group migrates to another node.

-----

Additional info:

This is holding up a customer's SAP NetWeaver deployment. We have a consultant working with them on the configuration.

A negative colocation constraint between the ASCS group (placed first) and the ERS group (placed second) is a key part of an SAP NetWeaver Pacemaker configuration. The new factor introduced in this deployment is the migration-threshold attribute.

Comment 1 Reid Wahl 2020-09-06 03:21:42 UTC

I just realized that the cluster is prone to this issue even without migration-threshold, if a start operation fails during recovery. I removed the migration-threshold meta attribute and verified.

  * Resource Group: dummya:
    * dummya_1	(ocf::heartbeat:Dummy):	 Started node1
    * dummya_2	(ocf::heartbeat:Dummy):	 Started node1
  * Resource Group: dummyb:
    * dummyb_1	(ocf::heartbeat:Dummy):	 Started node2
    * dummyb_2	(ocf::heartbeat:Dummy):	 Stopped

Failed Resource Actions:
  * dummyb_2_start_0 on node2 'error' (1): call=245, status='complete', exitreason='', last-rc-change='2020-09-05 20:20:12 -07:00', queued=0ms, exec=10ms

Comment 2 Reid Wahl 2020-09-06 03:54:44 UTC

I found a configuration hack that seems to work. This approach uses a placeholder dummy variable in a technique similar to one that Ken proposed in a separate email thread (http://post-office.corp.redhat.com/archives/cluster-list/2020-May/msg00066.html), except this time to achieve negative colocation with particular behavior.

As long as the placeholder resource stays online (which it should, barring user error), I think this will work.

[root@fastvm-rhel-8-0-23 pacemaker]# pcs config | egrep '(Group|Resource|Meta Attrs):'
 Group: dummya
  Resource: dummya_1 (class=ocf provider=heartbeat type=Dummy)
  Resource: dummya_2 (class=ocf provider=heartbeat type=Dummy)
   Meta Attrs: migration-threshold=1
 Group: dummyb
  Resource: dummyb_1 (class=ocf provider=heartbeat type=Dummy)
  Resource: dummyb_2 (class=ocf provider=heartbeat type=Dummy)
 Resource: placeholder (class=ocf provider=heartbeat type=Dummy)

[root@fastvm-rhel-8-0-23 pacemaker]# pcs constraint colocation
Colocation Constraints:
  placeholder with dummya (score:-INFINITY)
  dummyb with placeholder (score:5000)

[root@fastvm-rhel-8-0-23 pacemaker]# pcs status
...
  * Resource Group: dummya:
    * dummya_1	(ocf::heartbeat:Dummy):	 Started node1
    * dummya_2	(ocf::heartbeat:Dummy):	 Started node1
  * Resource Group: dummyb:
    * dummyb_1	(ocf::heartbeat:Dummy):	 Started node2
    * dummyb_2	(ocf::heartbeat:Dummy):	 Started node2
  * placeholder	(ocf::heartbeat:Dummy):	 Started node2

[root@fastvm-rhel-8-0-23 pacemaker]# crm_resource --fail --resource dummyb_2 --node node2

# # then start operation fails during recovery # #

[root@fastvm-rhel-8-0-23 pacemaker]# pcs status
  * Resource Group: dummya:
    * dummya_1	(ocf::heartbeat:Dummy):	 Started node1
    * dummya_2	(ocf::heartbeat:Dummy):	 Started node1
  * Resource Group: dummyb:
    * dummyb_1	(ocf::heartbeat:Dummy):	 Started node1
    * dummyb_2	(ocf::heartbeat:Dummy):	 Started node1
  * placeholder	(ocf::heartbeat:Dummy):	 Started node2

Failed Resource Actions:
  * dummyb_2_start_0 on node2 'error' (1): call=269, status='complete', exitreason='', last-rc-change='2020-09-05 20:50:06 -07:00', queued=0ms, exec=9ms

Comment 3 Reid Wahl 2020-09-06 06:18:35 UTC

Placed this workaround in KB 5374451.

Comment 4 Reid Wahl 2020-09-07 20:28:17 UTC

(In reply to Reid Wahl from comment #2)
> I found a configuration hack that seems to work.

After talking to the customer about the config hack with a dummy resource, a lot of the counter-intuitive nature of this comes down to the fact that the following two constraints behave differently when a non-base resource in the group reaches its migration threshold:

  (a) A constraint of -5000 with ASCS (the original config)
  (b) A constraint of 5000 with "not ASCS" (the config that uses a dummy resource as an intermediate)

The group **cannot** fail over with (a) as long as the base resource is **allowed** to run in its current location.

The group **can** fail over with (b) as long as the base resource is **allowed** to run in another location (and a non-base resource is **not allowed** to run in the current location after hitting migration-threshold).


And as noted earlier, that might not be changeable within the current constraints scheme.

Comment 8 RHEL Program Management 2022-03-01 15:49:01 UTC

Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.

Comment 10 Ken Gaillot 2023-05-02 21:04:15 UTC

Fixed in upstream 2.1 branch as of commit 0eae7d53b

Comment 15 Ken Gaillot 2023-06-29 16:18:50 UTC

added docs

Comment 17 jrehova 2023-07-11 14:55:57 UTC

Version of pacemaker:

> [root@virt-016:~]# rpm -q pacemaker
> pacemaker-2.1.6-2.el8.x86_64

Setting of cluster:

> [root@virt-016:~]# pcs status
> Cluster name: STSRHTS25395
> Cluster Summary:
>   * Stack: corosync (Pacemaker is running)
>   * Current DC: virt-016 (version 2.1.6-2.el8-6fdc9deea29) - partition with quorum
>   * Last updated: Mon Jul 10 22:41:08 2023 on virt-016
>   * Last change:  Mon Jul 10 22:38:16 2023 by root via cibadmin on virt-016
>   * 2 nodes configured
>   * 6 resource instances configured
> 
> Node List:
>   * Online: [ virt-016 virt-018 ]
> 
> Full List of Resources:
>   * fence-virt-016  (stonith:fence_xvm):   Started virt-016
>   * fence-virt-018  (stonith:fence_xvm):   Started virt-018
>   * Resource Group: dummya:
>     * dummya_1  (ocf::heartbeat:Dummy):  Started virt-016
>     * dummya_2  (ocf::heartbeat:Dummy):  Started virt-016
>   * Resource Group: dummyb:
>     * dummyb_1  (ocf::heartbeat:Dummy):  Started virt-018
>     * dummyb_2  (ocf::heartbeat:Dummy):  Started virt-018
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled

Resources in the cluster:

> [root@virt-016:~]# crm_resource --list
> Full List of Resources:
>   * fence-virt-016	(stonith:fence_xvm):	 Started
>   * fence-virt-018	(stonith:fence_xvm):	 Started
>   * Resource Group: dummya:
>     * dummya_1	(ocf::heartbeat:Dummy):	 Started
>     * dummya_2	(ocf::heartbeat:Dummy):	 Started
>   * Resource Group: dummyb:
>     * dummyb_1	(ocf::heartbeat:Dummy):	 Started
>     * dummyb_2	(ocf::heartbeat:Dummy):	 Started

Setting meta attribute migration-threshold=1 for node dummyb_2:

> [root@virt-016:~]# pcs resource create dummyb_2 ocf:heartbeat:Dummy meta migration-threshold=1
> [root@virt-016:~]# pcs cluster cib
...
>   </primitive>
>   <primitive class="ocf" id="dummyb_2" provider="heartbeat" type="Dummy">
>     <meta_attributes id="dummyb_2-meta_attributes">
>       <nvpair id="dummyb_2-meta_attributes-migration-threshold" name="migration-threshold" value="1"/>
>     </meta_attributes>
>     <operations>
>       <op id="dummyb_2-migrate_from-interval-0s" interval="0s" name="migrate_from" timeout="20s"/>
>       <op id="dummyb_2-migrate_to-interval-0s" interval="0s" name="migrate_to" timeout="20s"/>
>       <op id="dummyb_2-monitor-interval-10s" interval="10s" name="monitor" timeout="20s"/>
>       <op id="dummyb_2-reload-interval-0s" interval="0s" name="reload" timeout="20s"/>
>       <op id="dummyb_2-start-interval-0s" interval="0s" name="start" timeout="20s"/>
>       <op id="dummyb_2-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
>     </operations>
>   </primitive>
...

Setting constraint colocation to -5000:
 
> [root@virt-016:~]# pcs constraint colocation add dummyb with dummya score=-5000
> [root@virt-016:~]# pcs constraint colocation
> Colocation Constraints:
>   dummyb with dummya (score:-5000)

Failing resource dummyb_2 on node virt-018:

> [root@virt-016:~]# crm_resource --fail --resource dummyb_2 --node virt-018
> Waiting for 1 reply from the controller
> ... got reply (done)

Checking if group dummyb is moved to another node:

> [root@virt-016:~]# pcs status
> Cluster name: STSRHTS25395
> Cluster Summary:
>   * Stack: corosync (Pacemaker is running)
>   * Current DC: virt-016 (version 2.1.6-2.el8-6fdc9deea29) - partition with quorum
>   * Last updated: Mon Jul 10 22:42:34 2023 on virt-016
>   * Last change:  Mon Jul 10 22:38:16 2023 by root via cibadmin on virt-016
>   * 2 nodes configured
>   * 6 resource instances configured
> 
> Node List:
>   * Online: [ virt-016 virt-018 ]
> 
> Full List of Resources:
>   * fence-virt-016	(stonith:fence_xvm):	 Started virt-016
>   * fence-virt-018	(stonith:fence_xvm):	 Started virt-018
>   * Resource Group: dummya:
>     * dummya_1	(ocf::heartbeat:Dummy):	 Started virt-016
>     * dummya_2	(ocf::heartbeat:Dummy):	 Started virt-016
>   * Resource Group: dummyb:
>     * dummyb_1	(ocf::heartbeat:Dummy):	 Started virt-016
>     * dummyb_2	(ocf::heartbeat:Dummy):	 Started virt-016
> 
> Failed Resource Actions:
>   * dummyb_2_asyncmon_0 on virt-018 'error' (1): call=66, status='complete', exitreason='Simulated failure', last-rc-change='Mon Jul 10 22:42:17 2023', queued=0ms, exec=0ms
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled

Result: Group dummyb was moved from virt-018 to virt-016.

Comment 20 errata-xmlrpc 2023-11-14 15:32:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:6970

Note You need to log in before you can comment on or make changes to this bug.