Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1757951

Summary: A VirtualDomain resource kept stopping and starting after a failed migration of resource
Product: Red Hat Enterprise Linux 7 Reporter: Shane Bradley <sbradley>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.6CC: cluster-maint, msmazova, phagara, sbradley
Target Milestone: pre-dev-freeze   
Target Release: 7.9   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.22-1.el7 Doc Type: Bug Fix
Doc Text:
Cause: Pacemaker incorrectly considered a resource as being in a "dangling migration" on a node if a migration from that node failed, but a later migration back to, or start on, that node succeeded. Consequence: A resource in that situation would be unnecessarily stopped at one cluster-recheck-interval and started at the next, looping indefinitely. Fix: Pacemaker properly considers such a resource as running successfully on the node. Result: The resource is not restarted unnecessarily.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-29 20:03:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Ken Gaillot 2019-10-02 20:53:44 UTC
Fixed in the upstream 1.1 branch by commit dbe2eb3f

Reproducer for QA:
1. Configure a cluster with at least two nodes.
2. Configure a low cluster-recheck-interval (2 or 3 minutes would be good) so you don't have to wait half an hour to see results.
3. Configure an ocf:pacemaker:Dummy resource with allow-migrate=true and an op_sleep large enough for the next step, and wait for it to start.
4. Add a finite constraint preference for the resource on a node it's not running on, so that it begins to live migrate, but before the migrate_to action can complete on the origin node (i.e. within the configured op_sleep), kill -9 the resource agent process there to fail the action. The cluster should stop the resource then start it on the preferred node.
5. Add a stronger constraint preference for the resource on the original node. (Do not clean up the resource.)
6. The cluster should live-migrate the resource back to the original node. Before the fix, it will stop the resource at the next recheck interval, and start it again at the next recheck interval after that, repeating indefinitely. After the fix, the unnecessary restarts should not occur.

Comment 3 Ken Gaillot 2019-10-02 21:00:40 UTC
In case anyone runs into this before a fix is available, the workaround is to clean up the failed migration (pcs resource cleanup).

Comment 4 Patrik Hagara 2020-02-21 17:55:11 UTC
qa_ack+, reproducer in comment#1

Comment 7 Markéta Smazová 2020-06-24 19:42:33 UTC
before fix
--------------

>   [root@virt-134 11:04:35 ~]# rpm -q pacemaker
>   pacemaker-1.1.21-4.el7.x86_64

Set cluster-recheck interval to 2 minutes (120 seconds) and verify.

>   [root@virt-134 11:04:21 ~]# pcs property set cluster-recheck-interval=120
>   [root@virt-134 11:04:38 ~]# pcs property list

>   Cluster Properties:
>    cluster-infrastructure: corosync
>    cluster-name: STSRHTS26768
>    cluster-recheck-interval: 120
>    dc-version: 1.1.21-4.el7-f14e36fd43
>    have-watchdog: false

Configuring an ocf:pacemaker:Dummy resource "dummy1" on node "virt-134" with allow-migrate=true 
and set op_sleep option to 15 seconds.

>   [root@virt-134 11:05:18 ~]# pcs resource create dummy1 ocf:pacemaker:Dummy op_sleep=15 op monitor interval=20 timeout=30  meta allow-migrate=true
>   [root@virt-134 11:16:28 ~]# pcs status

>   Cluster name: STSRHTS26768
>   Stack: corosync
>   Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Tue Jun 16 11:17:32 2020
>   Last change: Tue Jun 16 11:05:29 2020 by root via cibadmin on virt-134

>   3 nodes configured
>   5 resources configured

>   Online: [ virt-133 virt-134 virt-135 ]

>   Full list of resources:

>    fence-virt-133	(stonith:fence_xvm):	Started virt-133
>    fence-virt-134	(stonith:fence_xvm):	Started virt-134
>    fence-virt-135	(stonith:fence_xvm):	Started virt-135
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-133
>    dummy1	(ocf::pacemaker:Dummy):	Started virt-134

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Adding a finite constraint preference for the resource "dummy 1" on a node it's not running on - "virt-135", 
so that "dummy1" begins to live-migrate to node "virt-135". Before the migrate_to action can complete on node "virt-134" 
where "dummy1" originally started (within the configured op_sleep time limit) the SIGKILL signal is sent to 
the resource agent process on node "virt-134" to fail the migration.

>   [root@virt-134 11:17:32 ~]# pcs constraint location dummy1 prefers virt-135=1; ps aux | grep Dummy

>   root      9082  0.0  0.0 115668  1776 ?        S    11:18   0:00 /bin/sh /usr/lib/ocf/resource.d/pacemaker/Dummy migrate_to
>   [...]
>   
>   [root@virt-134 11:18:12 ~]# kill -9 9082

Migration fails, Resource "dummy1" is stopped and then started on node "virt-135".

>   [root@virt-134 11:18:26 ~]# pcs status

>   Cluster name: STSRHTS26768
>   Stack: corosync
>   Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Tue Jun 16 11:18:28 2020
>   Last change: Tue Jun 16 11:18:12 2020 by root via cibadmin on virt-134

>   3 nodes configured
>   5 resources configured

>   Online: [ virt-133 virt-134 virt-135 ]

>   Full list of resources:

>    fence-virt-133	(stonith:fence_xvm):	Started virt-133
>    fence-virt-134	(stonith:fence_xvm):	Started virt-134
>    fence-virt-135	(stonith:fence_xvm):	Started virt-135
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-133
>    dummy1	(ocf::pacemaker:Dummy):	FAILED[ virt-135 virt-134 ]

>   Failed Resource Actions:
>   * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='',
>       last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

>   [root@virt-134 11:18:36 ~]# pcs status

>   Cluster name: STSRHTS26768
>   Stack: corosync
>   Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Tue Jun 16 11:18:50 2020
>   Last change: Tue Jun 16 11:18:12 2020 by root via cibadmin on virt-134

>   3 nodes configured
>   5 resources configured

>   Online: [ virt-133 virt-134 virt-135 ]

>   Full list of resources:

>    fence-virt-133	(stonith:fence_xvm):	Started virt-133
>    fence-virt-134	(stonith:fence_xvm):	Started virt-134
>    fence-virt-135	(stonith:fence_xvm):	Started virt-135
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-133
>    dummy1	(ocf::pacemaker:Dummy):	Starting virt-135

>   Failed Resource Actions:
>   * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='',
>       last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Adding stronger constraint preference for node "virt-134" so "dummy1" is live-migrated back to the original node "virt-134".

>   [root@virt-134 11:19:06 ~]# pcs constraint location dummy1 prefers virt-134=2; pcs status

>   Cluster name: STSRHTS26768
>   Stack: corosync
>   Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Tue Jun 16 11:20:45 2020
>   Last change: Tue Jun 16 11:20:44 2020 by root via cibadmin on virt-134

>   3 nodes configured
>   5 resources configured

>   Online: [ virt-133 virt-134 virt-135 ]

>   Full list of resources:

>    fence-virt-133	(stonith:fence_xvm):	Started virt-133
>    fence-virt-134	(stonith:fence_xvm):	Started virt-134
>    fence-virt-135	(stonith:fence_xvm):	Started virt-135
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-133
>    dummy1	(ocf::pacemaker:Dummy):	Migrating virt-135

>   Failed Resource Actions:
>   * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='',
>       last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled
       
>   [root@virt-134 11:21:10 ~]# pcs status

>   Cluster name: STSRHTS26768
>   Stack: corosync
>   Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Tue Jun 16 11:21:31 2020
>   Last change: Tue Jun 16 11:20:44 2020 by root via cibadmin on virt-134

>   3 nodes configured
>   5 resources configured

>   Online: [ virt-133 virt-134 virt-135 ]

>   Full list of resources:

>    fence-virt-133	(stonith:fence_xvm):	Started virt-133
>    fence-virt-134	(stonith:fence_xvm):	Started virt-134
>    fence-virt-135	(stonith:fence_xvm):	Started virt-135
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-133
>    dummy1	(ocf::pacemaker:Dummy):	Started virt-134

>   Failed Resource Actions:
>   * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='',
>       last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

After 2 minutes, cluster-recheck-interval lapses and "dummy1" is stopped.

>   [root@virt-134 11:23:49 ~]# pcs status

>   Cluster name: STSRHTS26768
>   Stack: corosync
>   Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Tue Jun 16 11:24:03 2020
>   Last change: Tue Jun 16 11:20:44 2020 by root via cibadmin on virt-134

>   3 nodes configured
>   5 resources configured

>   Online: [ virt-133 virt-134 virt-135 ]

>   Full list of resources:

>    fence-virt-133	(stonith:fence_xvm):	Started virt-133
>    fence-virt-134	(stonith:fence_xvm):	Started virt-134
>    fence-virt-135	(stonith:fence_xvm):	Started virt-135
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-133
>    dummy1	(ocf::pacemaker:Dummy):	Stopped

>   Failed Resource Actions:
>   * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='',
>       last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled
  
Resource "dummy1" is then restarted after another 2 minutes at the next 'cluster-recheck-interval'.

>   [root@virt-134 11:26:21 ~]# pcs status

>   Cluster name: STSRHTS26768
>   Stack: corosync
>   Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Tue Jun 16 11:26:33 2020
>   Last change: Tue Jun 16 11:20:44 2020 by root via cibadmin on virt-134

>   3 nodes configured
>   5 resources configured

>   Online: [ virt-133 virt-134 virt-135 ]

>   Full list of resources:

>    fence-virt-133	(stonith:fence_xvm):	Started virt-133
>    fence-virt-134	(stonith:fence_xvm):	Started virt-134
>    fence-virt-135	(stonith:fence_xvm):	Started virt-135
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-133
>    dummy1	(ocf::pacemaker:Dummy):	Started virt-134 (Monitoring)

>   Failed Resource Actions:
>   * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='',
>       last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

On another 'cluster-recheck-interval', "dummy1" is stopped again. This repeats indefinitely unless 'pcs resource cleanup' is used.
  
>   [root@virt-134 11:28:13 ~]# pcs status

>   Cluster name: STSRHTS26768
>   Stack: corosync
>   Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Tue Jun 16 11:31:25 2020
>   Last change: Tue Jun 16 11:20:44 2020 by root via cibadmin on virt-134

>   3 nodes configured
>   5 resources configured

>   Online: [ virt-133 virt-134 virt-135 ]

>   Full list of resources:

>    fence-virt-133	(stonith:fence_xvm):	Started virt-133
>    fence-virt-134	(stonith:fence_xvm):	Started virt-134
>    fence-virt-135	(stonith:fence_xvm):	Started virt-135
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-133
>    dummy1	(ocf::pacemaker:Dummy):	FAILED virt-134

>   Failed Resource Actions:
>   * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='',
>       last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


Log excerpts:

Changing location constraint and failing the migration:

>   Jun 16 11:18:12 [7684] virt-134.cluster-qe.lab.eng.brq.redhat.com        cib:     info: cib_perform_op:	++ /cib/configuration/constraints:  <rsc_location id="location-dummy1-virt-135-1" node="virt-135" rsc="dummy1" score="1"/>
>   Jun 16 11:18:12 [7685] virt-134.cluster-qe.lab.eng.brq.redhat.com stonith-ng:     info: update_cib_stonith_devices_v2:	Updating device list from the cib: create constraints
>   Jun 16 11:18:12 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	dummy1	(ocf::pacemaker:Dummy):	Started virt-134
>   Jun 16 11:18:12 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: RecurringOp:	 Start recurring monitor (20s) for dummy1 on virt-135
>   Jun 16 11:18:12 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:   notice: LogAction:	 * Migrate    dummy1             ( virt-134 -> virt-135 )  
>   Jun 16 11:18:12 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating migrate_to operation dummy1_migrate_to_0 locally on virt-134 | action 17
>   Jun 16 11:18:12  Dummy(dummy1)[9082]:    INFO: Migrating dummy1 to virt-135.
>   [...]
>   Jun 16 11:18:26 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:  warning: operation_finished:	dummy1_migrate_to_0:9082 - terminated with signal 9
>   Jun 16 11:18:26 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_finished:	finished - rsc:dummy1 action:migrate_to call_id:49 pid:9082 exit-code:1 exec-time:13491ms queue-time:0ms
>   Jun 16 11:18:26 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:    error: process_lrm_event:	Result of migrate_to operation for dummy1 on virt-134: Error | call=49 key=dummy1_migrate_to_0 confirmed=true status=4 cib-update=104
>   Jun 16 11:18:26 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:  warning: status_from_rc:	Action 17 (dummy1_migrate_to_0) on virt-134 failed (target: 0 vs. rc: 1): Error
>   Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-133 is online
>   Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-134 is online
>   Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-135 is online
>   Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:  warning: unpack_rsc_op_failure:	Processing failed migrate_to of dummy1 on virt-134: unknown error | rc=1
>   Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	dummy1	(ocf::pacemaker:Dummy):	FAILED
>   Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:		1 : virt-135
>   Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:		2 : virt-134
>   Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:    error: native_create_actions:	Resource dummy1 is active on 2 nodes (attempting recovery)
>   Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: RecurringOp:	 Start recurring monitor (20s) for dummy1 on virt-135

Resource "dummy1" is stopped and recovered on node "virt-135"

>   Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:   notice: LogAction:	 * Recover    dummy1             (             virt-135 )  
>   Jun 16 11:18:26 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating stop operation dummy1_stop_0 on virt-135 | action 15
>   Jun 16 11:18:26 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating stop operation dummy1_stop_0 locally on virt-134 | action 5
>   [...]
>   Jun 16 11:18:41 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_stop_0 (15) confirmed on virt-135 (rc=0)
>   Jun 16 11:18:41 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_finished:	finished - rsc:dummy1 action:stop call_id:52 pid:9100 exit-code:0 exec-time:15110ms queue-time:0ms
>   Jun 16 11:18:41 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: process_lrm_event:	Result of stop operation for dummy1 on virt-134: 0 (ok) | call=52 key=dummy1_stop_0 confirmed=true cib-update=107
>   Jun 16 11:18:41 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_stop_0 (5) confirmed on virt-134 (rc=0)
>   Jun 16 11:18:41 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating start operation dummy1_start_0 on virt-135 | action 16
>   [...]
>   Jun 16 11:18:56 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_start_0 (16) confirmed on virt-135 (rc=0)
>   Jun 16 11:18:56 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating monitor operation dummy1_monitor_20000 on virt-135 | action 17
>   [...]
>   Jun 16 11:19:11 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_monitor_20000 (17) confirmed on virt-135 (rc=0)

Adding stronger constraint preference for node "virt-134" and "dummy1" is live-migrated back to the original node "virt-134".

>   Jun 16 11:20:44 [7684] virt-134.cluster-qe.lab.eng.brq.redhat.com        cib:     info: cib_perform_op:	++ /cib/configuration/constraints:  <rsc_location id="location-dummy1-virt-134-2" node="virt-134" rsc="dummy1" score="2"/>
>   Jun 16 11:20:44 [7685] virt-134.cluster-qe.lab.eng.brq.redhat.com stonith-ng:     info: update_cib_stonith_devices_v2:	Updating device list from the cib: create constraints
>   Jun 16 11:20:44 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-133 is online
>   Jun 16 11:20:44 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-134 is online
>   Jun 16 11:20:44 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-135 is online
>   Jun 16 11:20:44 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	dummy1	(ocf::pacemaker:Dummy):	Started virt-135
>   [...]
>   Jun 16 11:20:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: RecurringOp:	 Start recurring monitor (20s) for dummy1 on virt-134
>   Jun 16 11:20:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:   notice: LogAction:	 * Migrate    dummy1             ( virt-135 -> virt-134 )
>   Jun 16 11:20:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating migrate_to operation dummy1_migrate_to_0 on virt-135 | action 17
>   [...]
>   Jun 16 11:21:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_migrate_to_0 (17) confirmed on virt-135 (rc=0)
>   Jun 16 11:21:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating migrate_from operation dummy1_migrate_from_0 locally on virt-134 | action 18
>   Jun 16 11:21:00  Dummy(dummy1)[9241]:    INFO: Migrating dummy1 from virt-135.
>   [...]
>   Jun 16 11:21:15 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_finished:	finished - rsc:dummy1 action:migrate_from call_id:54 pid:9241 exit-code:0 exec-time:15051ms queue-time:0ms
>   Jun 16 11:21:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: process_lrm_event:	Result of migrate_from operation for dummy1 on virt-134: 0 (ok) | call=54 key=dummy1_migrate_from_0 confirmed=true cib-update=110
>   Jun 16 11:21:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_migrate_from_0 (18) confirmed on virt-134 (rc=0)
>   Jun 16 11:21:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating stop operation dummy1_stop_0 on virt-135 | action 14
>   [...]
>   Jun 16 11:21:30 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_stop_0 (14) confirmed on virt-135 (rc=0)
>   Jun 16 11:21:30 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating monitor operation dummy1_monitor_20000 locally on virt-134 | action 16
>   Jun 16 11:21:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: process_lrm_event:	Result of monitor operation for dummy1 on virt-134: 0 (ok) | call=56 key=dummy1_monitor_20000 confirmed=false cib-update=112
>   Jun 16 11:21:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_monitor_20000 (16) confirmed on virt-134 (rc=0)

After 2 minutes, cluster-recheck-interval lapses and "dummy1" is stopped.

>   Jun 16 11:23:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: crm_timer_popped:	PEngine Recheck Timer (I_PE_CALC) just popped (120000ms)
>   Jun 16 11:23:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-133 is online
>   Jun 16 11:23:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-134 is online
>   Jun 16 11:23:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-135 is online
>   Jun 16 11:23:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	dummy1	(ocf::pacemaker:Dummy):	Started virt-134
>   Jun 16 11:23:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating stop operation dummy1_stop_0 locally on virt-134 | action 14
>   Jun 16 11:23:45 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: cancel_recurring_action:	Cancelling ocf operation dummy1_monitor_20000
>   Jun 16 11:23:45 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_execute:	executing - rsc:dummy1 action:stop call_id:59
>   Jun 16 11:23:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: process_lrm_event:	Result of monitor operation for dummy1 on virt-134: Cancelled | call=56 key=dummy1_monitor_20000 confirmed=true
>   [...]
>   Jun 16 11:24:00 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_finished:	finished - rsc:dummy1 action:stop call_id:59 pid:9362 exit-code:0 exec-time:15116ms queue-time:0ms
>   Jun 16 11:24:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: process_lrm_event:	Result of stop operation for dummy1 on virt-134: 0 (ok) | call=59 key=dummy1_stop_0 confirmed=true cib-update=115
>   Jun 16 11:24:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_stop_0 (14) confirmed on virt-134 (rc=0)

Resource "dummy1" is then restarted after another 2 minutes at the next 'cluster-recheck-interval'.

>   Jun 16 11:26:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: crm_timer_popped:	PEngine Recheck Timer (I_PE_CALC) just popped (120000ms)
>   Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-133 is online
>   Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-134 is online
>   Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-135 is online
>   Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	dummy1	(ocf::pacemaker:Dummy):	Stopped
>   Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: RecurringOp:	 Start recurring monitor (20s) for dummy1 on virt-134
>   Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:   notice: LogAction:	 * Restart    dummy1             (             virt-134 )  
>   Jun 16 11:26:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating stop operation dummy1_stop_0 locally on virt-134 | action 13
>   Jun 16 11:26:00 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_execute:	executing - rsc:dummy1 action:stop call_id:62
>   [...]
>   Jun 16 11:26:15 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_finished:	finished - rsc:dummy1 action:stop call_id:62 pid:9868 exit-code:0 exec-time:15043ms queue-time:0ms
>   Jun 16 11:26:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: process_lrm_event:	Result of stop operation for dummy1 on virt-134: 0 (ok) | call=62 key=dummy1_stop_0 confirmed=true cib-update=118
>   Jun 16 11:26:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_stop_0 (13) confirmed on virt-134 (rc=0)
>   Jun 16 11:26:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating start operation dummy1_start_0 locally on virt-134 | action 14
>   Jun 16 11:26:15 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_execute:	executing - rsc:dummy1 action:start call_id:64
>   [...]
>   Jun 16 11:26:30 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_finished:	finished - rsc:dummy1 action:start call_id:64 pid:9898 exit-code:0 exec-time:15104ms queue-time:0ms
>   Jun 16 11:26:30 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: process_lrm_event:	Result of start operation for dummy1 on virt-134: 0 (ok) | call=64 key=dummy1_start_0 confirmed=true cib-update=120
>   Jun 16 11:26:30 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_start_0 (14) confirmed on virt-134 (rc=0)
>   Jun 16 11:26:30 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating monitor operation dummy1_monitor_20000 locally on virt-134 | action 15
>   [...]
>   Jun 16 11:26:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: process_lrm_event:	Result of monitor operation for dummy1 on virt-134: 0 (ok) | call=66 key=dummy1_monitor_20000 confirmed=false cib-update=122
>   Jun 16 11:26:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_monitor_20000 (15) confirmed on virt-134 (rc=0)

On another 'cluster-recheck-interval', "dummy1" is stopped again. This repeats indefinitely unless 'pcs resource cleanup' is used.

>   Jun 16 11:28:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: crm_timer_popped:	PEngine Recheck Timer (I_PE_CALC) just popped (120000ms)
>   Jun 16 11:28:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-133 is online
>   Jun 16 11:28:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-134 is online
>   Jun 16 11:28:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-135 is online
>   Jun 16 11:28:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	dummy1	(ocf::pacemaker:Dummy):	Started virt-134
>   Jun 16 11:28:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating stop operation dummy1_stop_0 locally on virt-134 | action 14
>   Jun 16 11:28:45 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: cancel_recurring_action:	Cancelling ocf operation dummy1_monitor_20000
>   Jun 16 11:28:45 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_execute:	executing - rsc:dummy1 action:stop call_id:69
>   Jun 16 11:28:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: process_lrm_event:	Result of monitor operation for dummy1 on virt-134: Cancelled | call=66 key=dummy1_monitor_20000 confirmed=true
>   [...]
>   Jun 16 11:29:01 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_finished:	finished - rsc:dummy1 action:stop call_id:69 pid:10194 exit-code:0 exec-time:15104ms queue-time:0ms
>   Jun 16 11:29:01 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: process_lrm_event:	Result of stop operation for dummy1 on virt-134: 0 (ok) | call=69 key=dummy1_stop_0 confirmed=true cib-update=125
>   Jun 16 11:29:01 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy1_stop_0 (14) confirmed on virt-134 (rc=0)



after fix
-----------

>   [root@virt-039 20:14:26 ~]# rpm -q pacemaker
>   pacemaker-1.1.23-1.el7.x86_64

Set cluster-recheck interval to 2 minutes (120 seconds) and verify.

>   [root@virt-039 20:14:39 ~]# pcs property set cluster-recheck-interval=120; pcs property list

>   Cluster Properties:
>    cluster-infrastructure: corosync
>    cluster-name: STSRHTS21922
>    cluster-recheck-interval: 120
>    dc-version: 1.1.23-1.el7-9acf116022
>    have-watchdog: false

Configuring an ocf:pacemaker:Dummy resource "dummy2" on node "virt-039" with allow-migrate=true 
and set op_sleep option to 15 seconds.

>   [root@virt-039 20:14:47 ~]# pcs resource create dummy2 ocf:pacemaker:Dummy op_sleep=15 op monitor interval=20 timeout=30  meta allow-migrate=true
>   [root@virt-039 20:15:05 ~]# pcs status

>   Cluster name: STSRHTS21922
>   Stack: corosync
>   Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum
>   Last updated: Tue Jun 23 20:15:54 2020
>   Last change: Tue Jun 23 20:15:05 2020 by root via cibadmin on virt-039

>   3 nodes configured
>   5 resource instances configured

>   Online: [ virt-024 virt-039 virt-041 ]

>   Full list of resources:

>    fence-virt-024	(stonith:fence_xvm):	Started virt-024
>    fence-virt-039	(stonith:fence_xvm):	Started virt-039
>    fence-virt-041	(stonith:fence_xvm):	Started virt-041
>    dummy1	(ocf::pacemaker:Dummy):	Started virt-024
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-039

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Adding a finite constraint preference for the resource "dummy2" on a node it's not running on - "virt-041", 
so that "dummy2" begins to live-migrate to node "virt-041". Before the migrate_to action can complete on node "virt-039" 
where "dummy2" originally started (within the configured op_sleep time limit) the SIGKILL signal is sent to 
the resource agent process on node "virt-039" to fail the migration.

>   [root@virt-039 20:15:54 ~]# pcs constraint location dummy2 prefers virt-041=1
>   [root@virt-039 20:16:27 ~]# ps aux | grep Dummy

>   root      2795  0.5  0.0 115672  1888 ?        S    20:16   0:00 /bin/sh /usr/lib/ocf/resource.d/pacemaker/Dummy migrate_to
>   [...]
>   
>   [root@virt-039 20:16:28 ~]# kill -9 2795
  
Migration fails, resource "dummy2" is stopped and then started on node "virt-041".   
  
>   [root@virt-039 20:16:35 ~]# pcs status

>   Cluster name: STSRHTS21922
>   Stack: corosync
>   Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum
>   Last updated: Tue Jun 23 20:17:03 2020
>   Last change: Tue Jun 23 20:16:27 2020 by root via cibadmin on virt-039

>   3 nodes configured
>   5 resource instances configured

>   Online: [ virt-024 virt-039 virt-041 ]

>   Full list of resources:

>    fence-virt-024	(stonith:fence_xvm):	Started virt-024
>    fence-virt-039	(stonith:fence_xvm):	Started virt-039
>    fence-virt-041	(stonith:fence_xvm):	Started virt-041
>    dummy1	(ocf::pacemaker:Dummy):	Started virt-024
>    dummy2	(ocf::pacemaker:Dummy):	Starting virt-041

>   Failed Resource Actions:
>   * dummy2_migrate_to_0 on virt-039 'unknown error' (1): call=54, status=Error, exitreason='',
>       last-rc-change='Tue Jun 23 20:16:27 2020', queued=0ms, exec=8603ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled
  
>   [root@virt-039 20:17:03 ~]# pcs status

>   Cluster name: STSRHTS21922
>   Stack: corosync
>   Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum
>   Last updated: Tue Jun 23 20:17:41 2020
>   Last change: Tue Jun 23 20:16:27 2020 by root via cibadmin on virt-039

>   3 nodes configured
>   5 resource instances configured

>   Online: [ virt-024 virt-039 virt-041 ]

>   Full list of resources:

>    fence-virt-024	(stonith:fence_xvm):	Started virt-024
>    fence-virt-039	(stonith:fence_xvm):	Started virt-039
>    fence-virt-041	(stonith:fence_xvm):	Started virt-041
>    dummy1	(ocf::pacemaker:Dummy):	Started virt-024
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-041

>   Failed Resource Actions:
>   * dummy2_migrate_to_0 on virt-039 'unknown error' (1): call=54, status=Error, exitreason='',
>       last-rc-change='Tue Jun 23 20:16:27 2020', queued=0ms, exec=8603ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled
    
Adding stronger constraint preference for node "virt-039" so "dummy2" is live-migrated back to the original node "virt-039".
    
>   [root@virt-039 20:17:42 ~]# pcs constraint location dummy2 prefers virt-039=2; pcs status

>   Cluster name: STSRHTS21922
>   Stack: corosync
>   Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum
>   Last updated: Tue Jun 23 20:18:11 2020
>   Last change: Tue Jun 23 20:18:11 2020 by root via cibadmin on virt-039

>   3 nodes configured
>   5 resource instances configured

>   Online: [ virt-024 virt-039 virt-041 ]

>   Full list of resources:

>    fence-virt-024	(stonith:fence_xvm):	Started virt-024
>    fence-virt-039	(stonith:fence_xvm):	Started virt-039
>    fence-virt-041	(stonith:fence_xvm):	Started virt-041
>    dummy1	(ocf::pacemaker:Dummy):	Started virt-024
>    dummy2	(ocf::pacemaker:Dummy):	Migrating virt-041

>   Failed Resource Actions:
>   * dummy2_migrate_to_0 on virt-039 'unknown error' (1): call=54, status=Error, exitreason='',
>       last-rc-change='Tue Jun 23 20:16:27 2020', queued=0ms, exec=8603ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled
>     
>   [root@virt-039 20:18:11 ~]# pcs status
>
>   Cluster name: STSRHTS21922
>   Stack: corosync
>   Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum
>   Last updated: Tue Jun 23 20:19:12 2020
>   Last change: Tue Jun 23 20:18:11 2020 by root via cibadmin on virt-039

>   3 nodes configured
>   5 resource instances configured

>   Online: [ virt-024 virt-039 virt-041 ]

>   Full list of resources:

>    fence-virt-024	(stonith:fence_xvm):	Started virt-024
>    fence-virt-039	(stonith:fence_xvm):	Started virt-039
>    fence-virt-041	(stonith:fence_xvm):	Started virt-041
>    dummy1	(ocf::pacemaker:Dummy):	Started virt-024
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-039

>   Failed Resource Actions:
>   * dummy2_migrate_to_0 on virt-039 'unknown error' (1): call=54, status=Error, exitreason='',
>       last-rc-change='Tue Jun 23 20:16:27 2020', queued=0ms, exec=8603ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled
    
After 2 minutes, cluster-recheck-interval lapses and "dummy2" is still running on original node "virt-039".    
  
>   [root@virt-039 20:20:34 ~]# pcs status

>   Cluster name: STSRHTS21922
>   Stack: corosync
>   Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum
>   Last updated: Tue Jun 23 20:21:02 2020
>   Last change: Tue Jun 23 20:18:11 2020 by root via cibadmin on virt-039

>   3 nodes configured
>   5 resource instances configured

>   Online: [ virt-024 virt-039 virt-041 ]

>   Full list of resources:

>    fence-virt-024	(stonith:fence_xvm):	Started virt-024
>    fence-virt-039	(stonith:fence_xvm):	Started virt-039
>    fence-virt-041	(stonith:fence_xvm):	Started virt-041
>    dummy1	(ocf::pacemaker:Dummy):	Started virt-024
>    dummy2	(ocf::pacemaker:Dummy):	Started virt-039

>   Failed Resource Actions:
>   * dummy2_migrate_to_0 on virt-039 'unknown error' (1): call=54, status=Error, exitreason='',
>       last-rc-change='Tue Jun 23 20:16:27 2020', queued=0ms, exec=8603ms

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled
     
Log excerpts:     

Changing location constraint and failing the migration:

>   Jun 23 20:16:27 [1624] virt-039.cluster-qe.lab.eng.brq.redhat.com        cib:     info: cib_perform_op:	++ /cib/configuration/constraints:  <rsc_location id="location-dummy2-virt-041-1" node="virt-041" rsc="dummy2" score="1"/>
>   Jun 23 20:16:27 [1625] virt-039.cluster-qe.lab.eng.brq.redhat.com stonith-ng:     info: update_cib_stonith_devices_v2:	Updating device list from the cib: create constraints
>   Jun 23 20:16:27 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	dummy2	(ocf::pacemaker:Dummy):	Started virt-039
>   Jun 23 20:16:27 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: RecurringOp:	 Start recurring monitor (20s) for dummy2 on virt-041
>   Jun 23 20:16:27 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:   notice: LogAction:	 * Migrate    dummy2             ( virt-039 -> virt-041 )  
>   Jun 23 20:16:27 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating migrate_to operation dummy2_migrate_to_0 locally on virt-039 | action 17
>   Jun 23 20:16:27 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: cancel_recurring_action:	Cancelling ocf operation dummy2_monitor_20000
>   Jun 23 20:16:27 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_execute:	executing - rsc:dummy2 action:migrate_to call_id:54
>   Jun 23 20:16:27 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: process_lrm_event:	Result of monitor operation for dummy2 on virt-039: Cancelled | call=51 key=dummy2_monitor_20000 confirmed=true
>   [...]
>   Jun 23 20:16:35 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com       lrmd:  warning: operation_finished:	dummy2_migrate_to_0:2795 - terminated with signal 9
>   Jun 23 20:16:35 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_finished:	finished - rsc:dummy2 action:migrate_to call_id:54 pid:2795 exit-code:1 exec-time:8603ms queue-time:0ms
>   Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:    error: process_lrm_event:	Result of migrate_to operation for dummy2 on virt-039: Error | call=54 key=dummy2_migrate_to_0 confirmed=true status=4 cib-update=101
>   Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:  warning: status_from_rc:	Action 17 (dummy2_migrate_to_0) on virt-039 failed (target: 0 vs. rc: 1): Error
>   Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy2_migrate_to_0 (17) confirmed on virt-039 (rc=1)
>   Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: process_graph_event:	Detected action (15.17) dummy2_migrate_to_0.54=unknown error: failed
>   Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-039 is online
>   Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-024 is online
>   Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-041 is online
>   Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:  warning: unpack_rsc_op_failure:	Processing failed migrate_to of dummy2 on virt-039: unknown error | rc=1
>   Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	dummy2	(ocf::pacemaker:Dummy):	FAILED
>   Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:		1 : virt-041
>   Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:		2 : virt-039
>   Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:    error: native_create_actions:	Resource dummy2 is active on 2 nodes (attempting recovery)
>   Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: RecurringOp:	 Start recurring monitor (20s) for dummy2 on virt-041

Resource "dummy2" is stopped and recovered on node "virt-041".

>   Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:   notice: LogAction:	 * Recover    dummy2             (             virt-041 )  
>   Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating stop operation dummy2_stop_0 on virt-041 | action 15
>   Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating stop operation dummy2_stop_0 locally on virt-039 | action 3
>   Jun 23 20:16:35 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_execute:	executing - rsc:dummy2 action:stop call_id:57
>   [...]
>   Jun 23 20:16:50 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy2_stop_0 (15) confirmed on virt-041 (rc=0)
>   Jun 23 20:16:50 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_finished:	finished - rsc:dummy2 action:stop call_id:57 pid:2813 exit-code:0 exec-time:15108ms queue-time:0ms
>   Jun 23 20:16:50 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: process_lrm_event:	Result of stop operation for dummy2 on virt-039: 0 (ok) | call=57 key=dummy2_stop_0 confirmed=true cib-update=104
>   Jun 23 20:16:50 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy2_stop_0 (3) confirmed on virt-039 (rc=0)
>   Jun 23 20:16:50 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating start operation dummy2_start_0 on virt-041 | action 16
>   [...]
>   Jun 23 20:17:05 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy2_start_0 (16) confirmed on virt-041 (rc=0)
>   Jun 23 20:17:05 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating monitor operation dummy2_monitor_20000 on virt-041 | action 17
>   [...]
>   Jun 23 20:17:21 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy2_monitor_20000 (17) confirmed on virt-041 (rc=0)


Adding stronger constraint preference for node "virt-039" and "dummy2" is live-migrated back to the original node "virt-039".

>   Jun 23 20:18:11 [1624] virt-039.cluster-qe.lab.eng.brq.redhat.com        cib:     info: cib_perform_op:	++ /cib/configuration/constraints:  <rsc_location id="location-dummy2-virt-039-2" node="virt-039" rsc="dummy2" score="2"/>
>   Jun 23 20:18:11 [1625] virt-039.cluster-qe.lab.eng.brq.redhat.com stonith-ng:     info: update_cib_stonith_devices_v2:	Updating device list from the cib: create constraints
>   Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-039 is online
>   Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-024 is online
>   Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-041 is online
>   Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:  warning: unpack_rsc_op_failure:	Processing failed migrate_to of dummy2 on virt-039: unknown error | rc=1
>   Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	dummy2	(ocf::pacemaker:Dummy):	Started virt-041
>   Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: RecurringOp:	 Start recurring monitor (20s) for dummy2 on virt-039
>   Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:   notice: LogAction:	 * Migrate    dummy2             ( virt-041 -> virt-039 )  
>   Jun 23 20:18:11 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating migrate_to operation dummy2_migrate_to_0 on virt-041 | action 17
>   [...]
>   Jun 23 20:18:26 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy2_migrate_to_0 (17) confirmed on virt-041 (rc=0)
>   Jun 23 20:18:26 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating migrate_from operation dummy2_migrate_from_0 locally on virt-039 | action 18
>   Jun 23 20:18:26 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_execute:	executing - rsc:dummy2 action:migrate_from call_id:59
>   [...]
>   Jun 23 20:18:41 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com       lrmd:     info: log_finished:	finished - rsc:dummy2 action:migrate_from call_id:59 pid:2891 exit-code:0 exec-time:15051ms queue-time:0ms
>   Jun 23 20:18:41 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: process_lrm_event:	Result of migrate_from operation for dummy2 on virt-039: 0 (ok) | call=59 key=dummy2_migrate_from_0 confirmed=true cib-update=107
>   Jun 23 20:18:41 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy2_migrate_from_0 (18) confirmed on virt-039 (rc=0)
>   Jun 23 20:18:41 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating stop operation dummy2_stop_0 on virt-041 | action 14
>   [...]
>   Jun 23 20:18:56 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy2_stop_0 (14) confirmed on virt-041 (rc=0)
>   Jun 23 20:18:56 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:   notice: te_rsc_command:	Initiating monitor operation dummy2_monitor_20000 locally on virt-039 | action 16
>   [...]
>   Jun 23 20:19:11 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: process_lrm_event:	Result of monitor operation for dummy2 on virt-039: 0 (ok) | call=61 key=dummy2_monitor_20000 confirmed=false cib-update=109
>   Jun 23 20:19:11 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: match_graph_event:	Action dummy2_monitor_20000 (16) confirmed on virt-039 (rc=0)
 
After 2 minutes, cluster-recheck-interval lapses and "dummy2" is still running on original node "virt-039".

>   Jun 23 20:21:11 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: crm_timer_popped:	PEngine Recheck Timer (I_PE_CALC) just popped (120000ms)
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-039 is online
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-024 is online
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: determine_online_status:	Node virt-041 is online
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	fence-virt-024	(stonith:fence_xvm):	Started virt-024
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	fence-virt-039	(stonith:fence_xvm):	Started virt-039
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	fence-virt-041	(stonith:fence_xvm):	Started virt-041
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	dummy1	(ocf::pacemaker:Dummy):	Started virt-024
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: common_print:	dummy2	(ocf::pacemaker:Dummy):	Started virt-039
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: LogActions:	Leave   fence-virt-024	(Started virt-024)
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: LogActions:	Leave   fence-virt-039	(Started virt-039)
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: LogActions:	Leave   fence-virt-041	(Started virt-041)
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: LogActions:	Leave   dummy1	(Started virt-024)
>   Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com    pengine:     info: LogActions:	Leave   dummy2	(Started virt-039)


marking verified in pacemaker-1.1.23-1.el7

Comment 9 errata-xmlrpc 2020-09-29 20:03:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3951