Bug 1757951
| Summary: | A VirtualDomain resource kept stopping and starting after a failed migration of resource | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Shane Bradley <sbradley> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.6 | CC: | cluster-maint, msmazova, phagara, sbradley |
| Target Milestone: | pre-dev-freeze | ||
| Target Release: | 7.9 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | pacemaker-1.1.22-1.el7 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: Pacemaker incorrectly considered a resource as being in a "dangling migration" on a node if a migration from that node failed, but a later migration back to, or start on, that node succeeded.
Consequence: A resource in that situation would be unnecessarily stopped at one cluster-recheck-interval and started at the next, looping indefinitely.
Fix: Pacemaker properly considers such a resource as running successfully on the node.
Result: The resource is not restarted unnecessarily.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-29 20:03:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Comment 1
Ken Gaillot
2019-10-02 20:53:44 UTC
In case anyone runs into this before a fix is available, the workaround is to clean up the failed migration (pcs resource cleanup). before fix -------------- > [root@virt-134 11:04:35 ~]# rpm -q pacemaker > pacemaker-1.1.21-4.el7.x86_64 Set cluster-recheck interval to 2 minutes (120 seconds) and verify. > [root@virt-134 11:04:21 ~]# pcs property set cluster-recheck-interval=120 > [root@virt-134 11:04:38 ~]# pcs property list > Cluster Properties: > cluster-infrastructure: corosync > cluster-name: STSRHTS26768 > cluster-recheck-interval: 120 > dc-version: 1.1.21-4.el7-f14e36fd43 > have-watchdog: false Configuring an ocf:pacemaker:Dummy resource "dummy1" on node "virt-134" with allow-migrate=true and set op_sleep option to 15 seconds. > [root@virt-134 11:05:18 ~]# pcs resource create dummy1 ocf:pacemaker:Dummy op_sleep=15 op monitor interval=20 timeout=30 meta allow-migrate=true > [root@virt-134 11:16:28 ~]# pcs status > Cluster name: STSRHTS26768 > Stack: corosync > Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum > Last updated: Tue Jun 16 11:17:32 2020 > Last change: Tue Jun 16 11:05:29 2020 by root via cibadmin on virt-134 > 3 nodes configured > 5 resources configured > Online: [ virt-133 virt-134 virt-135 ] > Full list of resources: > fence-virt-133 (stonith:fence_xvm): Started virt-133 > fence-virt-134 (stonith:fence_xvm): Started virt-134 > fence-virt-135 (stonith:fence_xvm): Started virt-135 > dummy2 (ocf::pacemaker:Dummy): Started virt-133 > dummy1 (ocf::pacemaker:Dummy): Started virt-134 > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Adding a finite constraint preference for the resource "dummy 1" on a node it's not running on - "virt-135", so that "dummy1" begins to live-migrate to node "virt-135". Before the migrate_to action can complete on node "virt-134" where "dummy1" originally started (within the configured op_sleep time limit) the SIGKILL signal is sent to the resource agent process on node "virt-134" to fail the migration. > [root@virt-134 11:17:32 ~]# pcs constraint location dummy1 prefers virt-135=1; ps aux | grep Dummy > root 9082 0.0 0.0 115668 1776 ? S 11:18 0:00 /bin/sh /usr/lib/ocf/resource.d/pacemaker/Dummy migrate_to > [...] > > [root@virt-134 11:18:12 ~]# kill -9 9082 Migration fails, Resource "dummy1" is stopped and then started on node "virt-135". > [root@virt-134 11:18:26 ~]# pcs status > Cluster name: STSRHTS26768 > Stack: corosync > Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum > Last updated: Tue Jun 16 11:18:28 2020 > Last change: Tue Jun 16 11:18:12 2020 by root via cibadmin on virt-134 > 3 nodes configured > 5 resources configured > Online: [ virt-133 virt-134 virt-135 ] > Full list of resources: > fence-virt-133 (stonith:fence_xvm): Started virt-133 > fence-virt-134 (stonith:fence_xvm): Started virt-134 > fence-virt-135 (stonith:fence_xvm): Started virt-135 > dummy2 (ocf::pacemaker:Dummy): Started virt-133 > dummy1 (ocf::pacemaker:Dummy): FAILED[ virt-135 virt-134 ] > Failed Resource Actions: > * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='', > last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > [root@virt-134 11:18:36 ~]# pcs status > Cluster name: STSRHTS26768 > Stack: corosync > Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum > Last updated: Tue Jun 16 11:18:50 2020 > Last change: Tue Jun 16 11:18:12 2020 by root via cibadmin on virt-134 > 3 nodes configured > 5 resources configured > Online: [ virt-133 virt-134 virt-135 ] > Full list of resources: > fence-virt-133 (stonith:fence_xvm): Started virt-133 > fence-virt-134 (stonith:fence_xvm): Started virt-134 > fence-virt-135 (stonith:fence_xvm): Started virt-135 > dummy2 (ocf::pacemaker:Dummy): Started virt-133 > dummy1 (ocf::pacemaker:Dummy): Starting virt-135 > Failed Resource Actions: > * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='', > last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Adding stronger constraint preference for node "virt-134" so "dummy1" is live-migrated back to the original node "virt-134". > [root@virt-134 11:19:06 ~]# pcs constraint location dummy1 prefers virt-134=2; pcs status > Cluster name: STSRHTS26768 > Stack: corosync > Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum > Last updated: Tue Jun 16 11:20:45 2020 > Last change: Tue Jun 16 11:20:44 2020 by root via cibadmin on virt-134 > 3 nodes configured > 5 resources configured > Online: [ virt-133 virt-134 virt-135 ] > Full list of resources: > fence-virt-133 (stonith:fence_xvm): Started virt-133 > fence-virt-134 (stonith:fence_xvm): Started virt-134 > fence-virt-135 (stonith:fence_xvm): Started virt-135 > dummy2 (ocf::pacemaker:Dummy): Started virt-133 > dummy1 (ocf::pacemaker:Dummy): Migrating virt-135 > Failed Resource Actions: > * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='', > last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > [root@virt-134 11:21:10 ~]# pcs status > Cluster name: STSRHTS26768 > Stack: corosync > Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum > Last updated: Tue Jun 16 11:21:31 2020 > Last change: Tue Jun 16 11:20:44 2020 by root via cibadmin on virt-134 > 3 nodes configured > 5 resources configured > Online: [ virt-133 virt-134 virt-135 ] > Full list of resources: > fence-virt-133 (stonith:fence_xvm): Started virt-133 > fence-virt-134 (stonith:fence_xvm): Started virt-134 > fence-virt-135 (stonith:fence_xvm): Started virt-135 > dummy2 (ocf::pacemaker:Dummy): Started virt-133 > dummy1 (ocf::pacemaker:Dummy): Started virt-134 > Failed Resource Actions: > * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='', > last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled After 2 minutes, cluster-recheck-interval lapses and "dummy1" is stopped. > [root@virt-134 11:23:49 ~]# pcs status > Cluster name: STSRHTS26768 > Stack: corosync > Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum > Last updated: Tue Jun 16 11:24:03 2020 > Last change: Tue Jun 16 11:20:44 2020 by root via cibadmin on virt-134 > 3 nodes configured > 5 resources configured > Online: [ virt-133 virt-134 virt-135 ] > Full list of resources: > fence-virt-133 (stonith:fence_xvm): Started virt-133 > fence-virt-134 (stonith:fence_xvm): Started virt-134 > fence-virt-135 (stonith:fence_xvm): Started virt-135 > dummy2 (ocf::pacemaker:Dummy): Started virt-133 > dummy1 (ocf::pacemaker:Dummy): Stopped > Failed Resource Actions: > * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='', > last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Resource "dummy1" is then restarted after another 2 minutes at the next 'cluster-recheck-interval'. > [root@virt-134 11:26:21 ~]# pcs status > Cluster name: STSRHTS26768 > Stack: corosync > Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum > Last updated: Tue Jun 16 11:26:33 2020 > Last change: Tue Jun 16 11:20:44 2020 by root via cibadmin on virt-134 > 3 nodes configured > 5 resources configured > Online: [ virt-133 virt-134 virt-135 ] > Full list of resources: > fence-virt-133 (stonith:fence_xvm): Started virt-133 > fence-virt-134 (stonith:fence_xvm): Started virt-134 > fence-virt-135 (stonith:fence_xvm): Started virt-135 > dummy2 (ocf::pacemaker:Dummy): Started virt-133 > dummy1 (ocf::pacemaker:Dummy): Started virt-134 (Monitoring) > Failed Resource Actions: > * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='', > last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled On another 'cluster-recheck-interval', "dummy1" is stopped again. This repeats indefinitely unless 'pcs resource cleanup' is used. > [root@virt-134 11:28:13 ~]# pcs status > Cluster name: STSRHTS26768 > Stack: corosync > Current DC: virt-134 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum > Last updated: Tue Jun 16 11:31:25 2020 > Last change: Tue Jun 16 11:20:44 2020 by root via cibadmin on virt-134 > 3 nodes configured > 5 resources configured > Online: [ virt-133 virt-134 virt-135 ] > Full list of resources: > fence-virt-133 (stonith:fence_xvm): Started virt-133 > fence-virt-134 (stonith:fence_xvm): Started virt-134 > fence-virt-135 (stonith:fence_xvm): Started virt-135 > dummy2 (ocf::pacemaker:Dummy): Started virt-133 > dummy1 (ocf::pacemaker:Dummy): FAILED virt-134 > Failed Resource Actions: > * dummy1_migrate_to_0 on virt-134 'unknown error' (1): call=49, status=Error, exitreason='', > last-rc-change='Tue Jun 16 11:18:12 2020', queued=0ms, exec=13491ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Log excerpts: Changing location constraint and failing the migration: > Jun 16 11:18:12 [7684] virt-134.cluster-qe.lab.eng.brq.redhat.com cib: info: cib_perform_op: ++ /cib/configuration/constraints: <rsc_location id="location-dummy1-virt-135-1" node="virt-135" rsc="dummy1" score="1"/> > Jun 16 11:18:12 [7685] virt-134.cluster-qe.lab.eng.brq.redhat.com stonith-ng: info: update_cib_stonith_devices_v2: Updating device list from the cib: create constraints > Jun 16 11:18:12 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: dummy1 (ocf::pacemaker:Dummy): Started virt-134 > Jun 16 11:18:12 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: RecurringOp: Start recurring monitor (20s) for dummy1 on virt-135 > Jun 16 11:18:12 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: notice: LogAction: * Migrate dummy1 ( virt-134 -> virt-135 ) > Jun 16 11:18:12 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating migrate_to operation dummy1_migrate_to_0 locally on virt-134 | action 17 > Jun 16 11:18:12 Dummy(dummy1)[9082]: INFO: Migrating dummy1 to virt-135. > [...] > Jun 16 11:18:26 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: warning: operation_finished: dummy1_migrate_to_0:9082 - terminated with signal 9 > Jun 16 11:18:26 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_finished: finished - rsc:dummy1 action:migrate_to call_id:49 pid:9082 exit-code:1 exec-time:13491ms queue-time:0ms > Jun 16 11:18:26 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: error: process_lrm_event: Result of migrate_to operation for dummy1 on virt-134: Error | call=49 key=dummy1_migrate_to_0 confirmed=true status=4 cib-update=104 > Jun 16 11:18:26 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: warning: status_from_rc: Action 17 (dummy1_migrate_to_0) on virt-134 failed (target: 0 vs. rc: 1): Error > Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-133 is online > Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-134 is online > Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-135 is online > Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: warning: unpack_rsc_op_failure: Processing failed migrate_to of dummy1 on virt-134: unknown error | rc=1 > Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: dummy1 (ocf::pacemaker:Dummy): FAILED > Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: 1 : virt-135 > Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: 2 : virt-134 > Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: error: native_create_actions: Resource dummy1 is active on 2 nodes (attempting recovery) > Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: RecurringOp: Start recurring monitor (20s) for dummy1 on virt-135 Resource "dummy1" is stopped and recovered on node "virt-135" > Jun 16 11:18:26 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: notice: LogAction: * Recover dummy1 ( virt-135 ) > Jun 16 11:18:26 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating stop operation dummy1_stop_0 on virt-135 | action 15 > Jun 16 11:18:26 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating stop operation dummy1_stop_0 locally on virt-134 | action 5 > [...] > Jun 16 11:18:41 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_stop_0 (15) confirmed on virt-135 (rc=0) > Jun 16 11:18:41 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_finished: finished - rsc:dummy1 action:stop call_id:52 pid:9100 exit-code:0 exec-time:15110ms queue-time:0ms > Jun 16 11:18:41 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: process_lrm_event: Result of stop operation for dummy1 on virt-134: 0 (ok) | call=52 key=dummy1_stop_0 confirmed=true cib-update=107 > Jun 16 11:18:41 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_stop_0 (5) confirmed on virt-134 (rc=0) > Jun 16 11:18:41 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating start operation dummy1_start_0 on virt-135 | action 16 > [...] > Jun 16 11:18:56 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_start_0 (16) confirmed on virt-135 (rc=0) > Jun 16 11:18:56 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating monitor operation dummy1_monitor_20000 on virt-135 | action 17 > [...] > Jun 16 11:19:11 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_monitor_20000 (17) confirmed on virt-135 (rc=0) Adding stronger constraint preference for node "virt-134" and "dummy1" is live-migrated back to the original node "virt-134". > Jun 16 11:20:44 [7684] virt-134.cluster-qe.lab.eng.brq.redhat.com cib: info: cib_perform_op: ++ /cib/configuration/constraints: <rsc_location id="location-dummy1-virt-134-2" node="virt-134" rsc="dummy1" score="2"/> > Jun 16 11:20:44 [7685] virt-134.cluster-qe.lab.eng.brq.redhat.com stonith-ng: info: update_cib_stonith_devices_v2: Updating device list from the cib: create constraints > Jun 16 11:20:44 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-133 is online > Jun 16 11:20:44 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-134 is online > Jun 16 11:20:44 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-135 is online > Jun 16 11:20:44 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: dummy1 (ocf::pacemaker:Dummy): Started virt-135 > [...] > Jun 16 11:20:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: RecurringOp: Start recurring monitor (20s) for dummy1 on virt-134 > Jun 16 11:20:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: notice: LogAction: * Migrate dummy1 ( virt-135 -> virt-134 ) > Jun 16 11:20:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating migrate_to operation dummy1_migrate_to_0 on virt-135 | action 17 > [...] > Jun 16 11:21:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_migrate_to_0 (17) confirmed on virt-135 (rc=0) > Jun 16 11:21:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating migrate_from operation dummy1_migrate_from_0 locally on virt-134 | action 18 > Jun 16 11:21:00 Dummy(dummy1)[9241]: INFO: Migrating dummy1 from virt-135. > [...] > Jun 16 11:21:15 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_finished: finished - rsc:dummy1 action:migrate_from call_id:54 pid:9241 exit-code:0 exec-time:15051ms queue-time:0ms > Jun 16 11:21:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: process_lrm_event: Result of migrate_from operation for dummy1 on virt-134: 0 (ok) | call=54 key=dummy1_migrate_from_0 confirmed=true cib-update=110 > Jun 16 11:21:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_migrate_from_0 (18) confirmed on virt-134 (rc=0) > Jun 16 11:21:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating stop operation dummy1_stop_0 on virt-135 | action 14 > [...] > Jun 16 11:21:30 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_stop_0 (14) confirmed on virt-135 (rc=0) > Jun 16 11:21:30 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating monitor operation dummy1_monitor_20000 locally on virt-134 | action 16 > Jun 16 11:21:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: process_lrm_event: Result of monitor operation for dummy1 on virt-134: 0 (ok) | call=56 key=dummy1_monitor_20000 confirmed=false cib-update=112 > Jun 16 11:21:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_monitor_20000 (16) confirmed on virt-134 (rc=0) After 2 minutes, cluster-recheck-interval lapses and "dummy1" is stopped. > Jun 16 11:23:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (120000ms) > Jun 16 11:23:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-133 is online > Jun 16 11:23:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-134 is online > Jun 16 11:23:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-135 is online > Jun 16 11:23:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: dummy1 (ocf::pacemaker:Dummy): Started virt-134 > Jun 16 11:23:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating stop operation dummy1_stop_0 locally on virt-134 | action 14 > Jun 16 11:23:45 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: cancel_recurring_action: Cancelling ocf operation dummy1_monitor_20000 > Jun 16 11:23:45 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_execute: executing - rsc:dummy1 action:stop call_id:59 > Jun 16 11:23:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: process_lrm_event: Result of monitor operation for dummy1 on virt-134: Cancelled | call=56 key=dummy1_monitor_20000 confirmed=true > [...] > Jun 16 11:24:00 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_finished: finished - rsc:dummy1 action:stop call_id:59 pid:9362 exit-code:0 exec-time:15116ms queue-time:0ms > Jun 16 11:24:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: process_lrm_event: Result of stop operation for dummy1 on virt-134: 0 (ok) | call=59 key=dummy1_stop_0 confirmed=true cib-update=115 > Jun 16 11:24:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_stop_0 (14) confirmed on virt-134 (rc=0) Resource "dummy1" is then restarted after another 2 minutes at the next 'cluster-recheck-interval'. > Jun 16 11:26:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (120000ms) > Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-133 is online > Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-134 is online > Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-135 is online > Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: dummy1 (ocf::pacemaker:Dummy): Stopped > Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: RecurringOp: Start recurring monitor (20s) for dummy1 on virt-134 > Jun 16 11:26:00 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: notice: LogAction: * Restart dummy1 ( virt-134 ) > Jun 16 11:26:00 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating stop operation dummy1_stop_0 locally on virt-134 | action 13 > Jun 16 11:26:00 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_execute: executing - rsc:dummy1 action:stop call_id:62 > [...] > Jun 16 11:26:15 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_finished: finished - rsc:dummy1 action:stop call_id:62 pid:9868 exit-code:0 exec-time:15043ms queue-time:0ms > Jun 16 11:26:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: process_lrm_event: Result of stop operation for dummy1 on virt-134: 0 (ok) | call=62 key=dummy1_stop_0 confirmed=true cib-update=118 > Jun 16 11:26:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_stop_0 (13) confirmed on virt-134 (rc=0) > Jun 16 11:26:15 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating start operation dummy1_start_0 locally on virt-134 | action 14 > Jun 16 11:26:15 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_execute: executing - rsc:dummy1 action:start call_id:64 > [...] > Jun 16 11:26:30 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_finished: finished - rsc:dummy1 action:start call_id:64 pid:9898 exit-code:0 exec-time:15104ms queue-time:0ms > Jun 16 11:26:30 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: process_lrm_event: Result of start operation for dummy1 on virt-134: 0 (ok) | call=64 key=dummy1_start_0 confirmed=true cib-update=120 > Jun 16 11:26:30 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_start_0 (14) confirmed on virt-134 (rc=0) > Jun 16 11:26:30 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating monitor operation dummy1_monitor_20000 locally on virt-134 | action 15 > [...] > Jun 16 11:26:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: process_lrm_event: Result of monitor operation for dummy1 on virt-134: 0 (ok) | call=66 key=dummy1_monitor_20000 confirmed=false cib-update=122 > Jun 16 11:26:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_monitor_20000 (15) confirmed on virt-134 (rc=0) On another 'cluster-recheck-interval', "dummy1" is stopped again. This repeats indefinitely unless 'pcs resource cleanup' is used. > Jun 16 11:28:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (120000ms) > Jun 16 11:28:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-133 is online > Jun 16 11:28:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-134 is online > Jun 16 11:28:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-135 is online > Jun 16 11:28:45 [7688] virt-134.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: dummy1 (ocf::pacemaker:Dummy): Started virt-134 > Jun 16 11:28:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating stop operation dummy1_stop_0 locally on virt-134 | action 14 > Jun 16 11:28:45 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: cancel_recurring_action: Cancelling ocf operation dummy1_monitor_20000 > Jun 16 11:28:45 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_execute: executing - rsc:dummy1 action:stop call_id:69 > Jun 16 11:28:45 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: process_lrm_event: Result of monitor operation for dummy1 on virt-134: Cancelled | call=66 key=dummy1_monitor_20000 confirmed=true > [...] > Jun 16 11:29:01 [7686] virt-134.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_finished: finished - rsc:dummy1 action:stop call_id:69 pid:10194 exit-code:0 exec-time:15104ms queue-time:0ms > Jun 16 11:29:01 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: notice: process_lrm_event: Result of stop operation for dummy1 on virt-134: 0 (ok) | call=69 key=dummy1_stop_0 confirmed=true cib-update=125 > Jun 16 11:29:01 [7689] virt-134.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy1_stop_0 (14) confirmed on virt-134 (rc=0) after fix ----------- > [root@virt-039 20:14:26 ~]# rpm -q pacemaker > pacemaker-1.1.23-1.el7.x86_64 Set cluster-recheck interval to 2 minutes (120 seconds) and verify. > [root@virt-039 20:14:39 ~]# pcs property set cluster-recheck-interval=120; pcs property list > Cluster Properties: > cluster-infrastructure: corosync > cluster-name: STSRHTS21922 > cluster-recheck-interval: 120 > dc-version: 1.1.23-1.el7-9acf116022 > have-watchdog: false Configuring an ocf:pacemaker:Dummy resource "dummy2" on node "virt-039" with allow-migrate=true and set op_sleep option to 15 seconds. > [root@virt-039 20:14:47 ~]# pcs resource create dummy2 ocf:pacemaker:Dummy op_sleep=15 op monitor interval=20 timeout=30 meta allow-migrate=true > [root@virt-039 20:15:05 ~]# pcs status > Cluster name: STSRHTS21922 > Stack: corosync > Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum > Last updated: Tue Jun 23 20:15:54 2020 > Last change: Tue Jun 23 20:15:05 2020 by root via cibadmin on virt-039 > 3 nodes configured > 5 resource instances configured > Online: [ virt-024 virt-039 virt-041 ] > Full list of resources: > fence-virt-024 (stonith:fence_xvm): Started virt-024 > fence-virt-039 (stonith:fence_xvm): Started virt-039 > fence-virt-041 (stonith:fence_xvm): Started virt-041 > dummy1 (ocf::pacemaker:Dummy): Started virt-024 > dummy2 (ocf::pacemaker:Dummy): Started virt-039 > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Adding a finite constraint preference for the resource "dummy2" on a node it's not running on - "virt-041", so that "dummy2" begins to live-migrate to node "virt-041". Before the migrate_to action can complete on node "virt-039" where "dummy2" originally started (within the configured op_sleep time limit) the SIGKILL signal is sent to the resource agent process on node "virt-039" to fail the migration. > [root@virt-039 20:15:54 ~]# pcs constraint location dummy2 prefers virt-041=1 > [root@virt-039 20:16:27 ~]# ps aux | grep Dummy > root 2795 0.5 0.0 115672 1888 ? S 20:16 0:00 /bin/sh /usr/lib/ocf/resource.d/pacemaker/Dummy migrate_to > [...] > > [root@virt-039 20:16:28 ~]# kill -9 2795 Migration fails, resource "dummy2" is stopped and then started on node "virt-041". > [root@virt-039 20:16:35 ~]# pcs status > Cluster name: STSRHTS21922 > Stack: corosync > Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum > Last updated: Tue Jun 23 20:17:03 2020 > Last change: Tue Jun 23 20:16:27 2020 by root via cibadmin on virt-039 > 3 nodes configured > 5 resource instances configured > Online: [ virt-024 virt-039 virt-041 ] > Full list of resources: > fence-virt-024 (stonith:fence_xvm): Started virt-024 > fence-virt-039 (stonith:fence_xvm): Started virt-039 > fence-virt-041 (stonith:fence_xvm): Started virt-041 > dummy1 (ocf::pacemaker:Dummy): Started virt-024 > dummy2 (ocf::pacemaker:Dummy): Starting virt-041 > Failed Resource Actions: > * dummy2_migrate_to_0 on virt-039 'unknown error' (1): call=54, status=Error, exitreason='', > last-rc-change='Tue Jun 23 20:16:27 2020', queued=0ms, exec=8603ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > [root@virt-039 20:17:03 ~]# pcs status > Cluster name: STSRHTS21922 > Stack: corosync > Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum > Last updated: Tue Jun 23 20:17:41 2020 > Last change: Tue Jun 23 20:16:27 2020 by root via cibadmin on virt-039 > 3 nodes configured > 5 resource instances configured > Online: [ virt-024 virt-039 virt-041 ] > Full list of resources: > fence-virt-024 (stonith:fence_xvm): Started virt-024 > fence-virt-039 (stonith:fence_xvm): Started virt-039 > fence-virt-041 (stonith:fence_xvm): Started virt-041 > dummy1 (ocf::pacemaker:Dummy): Started virt-024 > dummy2 (ocf::pacemaker:Dummy): Started virt-041 > Failed Resource Actions: > * dummy2_migrate_to_0 on virt-039 'unknown error' (1): call=54, status=Error, exitreason='', > last-rc-change='Tue Jun 23 20:16:27 2020', queued=0ms, exec=8603ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Adding stronger constraint preference for node "virt-039" so "dummy2" is live-migrated back to the original node "virt-039". > [root@virt-039 20:17:42 ~]# pcs constraint location dummy2 prefers virt-039=2; pcs status > Cluster name: STSRHTS21922 > Stack: corosync > Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum > Last updated: Tue Jun 23 20:18:11 2020 > Last change: Tue Jun 23 20:18:11 2020 by root via cibadmin on virt-039 > 3 nodes configured > 5 resource instances configured > Online: [ virt-024 virt-039 virt-041 ] > Full list of resources: > fence-virt-024 (stonith:fence_xvm): Started virt-024 > fence-virt-039 (stonith:fence_xvm): Started virt-039 > fence-virt-041 (stonith:fence_xvm): Started virt-041 > dummy1 (ocf::pacemaker:Dummy): Started virt-024 > dummy2 (ocf::pacemaker:Dummy): Migrating virt-041 > Failed Resource Actions: > * dummy2_migrate_to_0 on virt-039 'unknown error' (1): call=54, status=Error, exitreason='', > last-rc-change='Tue Jun 23 20:16:27 2020', queued=0ms, exec=8603ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > [root@virt-039 20:18:11 ~]# pcs status > > Cluster name: STSRHTS21922 > Stack: corosync > Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum > Last updated: Tue Jun 23 20:19:12 2020 > Last change: Tue Jun 23 20:18:11 2020 by root via cibadmin on virt-039 > 3 nodes configured > 5 resource instances configured > Online: [ virt-024 virt-039 virt-041 ] > Full list of resources: > fence-virt-024 (stonith:fence_xvm): Started virt-024 > fence-virt-039 (stonith:fence_xvm): Started virt-039 > fence-virt-041 (stonith:fence_xvm): Started virt-041 > dummy1 (ocf::pacemaker:Dummy): Started virt-024 > dummy2 (ocf::pacemaker:Dummy): Started virt-039 > Failed Resource Actions: > * dummy2_migrate_to_0 on virt-039 'unknown error' (1): call=54, status=Error, exitreason='', > last-rc-change='Tue Jun 23 20:16:27 2020', queued=0ms, exec=8603ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled After 2 minutes, cluster-recheck-interval lapses and "dummy2" is still running on original node "virt-039". > [root@virt-039 20:20:34 ~]# pcs status > Cluster name: STSRHTS21922 > Stack: corosync > Current DC: virt-039 (version 1.1.23-1.el7-9acf116022) - partition with quorum > Last updated: Tue Jun 23 20:21:02 2020 > Last change: Tue Jun 23 20:18:11 2020 by root via cibadmin on virt-039 > 3 nodes configured > 5 resource instances configured > Online: [ virt-024 virt-039 virt-041 ] > Full list of resources: > fence-virt-024 (stonith:fence_xvm): Started virt-024 > fence-virt-039 (stonith:fence_xvm): Started virt-039 > fence-virt-041 (stonith:fence_xvm): Started virt-041 > dummy1 (ocf::pacemaker:Dummy): Started virt-024 > dummy2 (ocf::pacemaker:Dummy): Started virt-039 > Failed Resource Actions: > * dummy2_migrate_to_0 on virt-039 'unknown error' (1): call=54, status=Error, exitreason='', > last-rc-change='Tue Jun 23 20:16:27 2020', queued=0ms, exec=8603ms > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Log excerpts: Changing location constraint and failing the migration: > Jun 23 20:16:27 [1624] virt-039.cluster-qe.lab.eng.brq.redhat.com cib: info: cib_perform_op: ++ /cib/configuration/constraints: <rsc_location id="location-dummy2-virt-041-1" node="virt-041" rsc="dummy2" score="1"/> > Jun 23 20:16:27 [1625] virt-039.cluster-qe.lab.eng.brq.redhat.com stonith-ng: info: update_cib_stonith_devices_v2: Updating device list from the cib: create constraints > Jun 23 20:16:27 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: dummy2 (ocf::pacemaker:Dummy): Started virt-039 > Jun 23 20:16:27 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: RecurringOp: Start recurring monitor (20s) for dummy2 on virt-041 > Jun 23 20:16:27 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: notice: LogAction: * Migrate dummy2 ( virt-039 -> virt-041 ) > Jun 23 20:16:27 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating migrate_to operation dummy2_migrate_to_0 locally on virt-039 | action 17 > Jun 23 20:16:27 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com lrmd: info: cancel_recurring_action: Cancelling ocf operation dummy2_monitor_20000 > Jun 23 20:16:27 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_execute: executing - rsc:dummy2 action:migrate_to call_id:54 > Jun 23 20:16:27 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: process_lrm_event: Result of monitor operation for dummy2 on virt-039: Cancelled | call=51 key=dummy2_monitor_20000 confirmed=true > [...] > Jun 23 20:16:35 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com lrmd: warning: operation_finished: dummy2_migrate_to_0:2795 - terminated with signal 9 > Jun 23 20:16:35 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_finished: finished - rsc:dummy2 action:migrate_to call_id:54 pid:2795 exit-code:1 exec-time:8603ms queue-time:0ms > Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: error: process_lrm_event: Result of migrate_to operation for dummy2 on virt-039: Error | call=54 key=dummy2_migrate_to_0 confirmed=true status=4 cib-update=101 > Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: warning: status_from_rc: Action 17 (dummy2_migrate_to_0) on virt-039 failed (target: 0 vs. rc: 1): Error > Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy2_migrate_to_0 (17) confirmed on virt-039 (rc=1) > Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: process_graph_event: Detected action (15.17) dummy2_migrate_to_0.54=unknown error: failed > Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-039 is online > Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-024 is online > Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-041 is online > Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: warning: unpack_rsc_op_failure: Processing failed migrate_to of dummy2 on virt-039: unknown error | rc=1 > Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: dummy2 (ocf::pacemaker:Dummy): FAILED > Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: 1 : virt-041 > Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: 2 : virt-039 > Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: error: native_create_actions: Resource dummy2 is active on 2 nodes (attempting recovery) > Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: RecurringOp: Start recurring monitor (20s) for dummy2 on virt-041 Resource "dummy2" is stopped and recovered on node "virt-041". > Jun 23 20:16:35 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: notice: LogAction: * Recover dummy2 ( virt-041 ) > Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating stop operation dummy2_stop_0 on virt-041 | action 15 > Jun 23 20:16:35 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating stop operation dummy2_stop_0 locally on virt-039 | action 3 > Jun 23 20:16:35 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_execute: executing - rsc:dummy2 action:stop call_id:57 > [...] > Jun 23 20:16:50 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy2_stop_0 (15) confirmed on virt-041 (rc=0) > Jun 23 20:16:50 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_finished: finished - rsc:dummy2 action:stop call_id:57 pid:2813 exit-code:0 exec-time:15108ms queue-time:0ms > Jun 23 20:16:50 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: notice: process_lrm_event: Result of stop operation for dummy2 on virt-039: 0 (ok) | call=57 key=dummy2_stop_0 confirmed=true cib-update=104 > Jun 23 20:16:50 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy2_stop_0 (3) confirmed on virt-039 (rc=0) > Jun 23 20:16:50 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating start operation dummy2_start_0 on virt-041 | action 16 > [...] > Jun 23 20:17:05 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy2_start_0 (16) confirmed on virt-041 (rc=0) > Jun 23 20:17:05 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating monitor operation dummy2_monitor_20000 on virt-041 | action 17 > [...] > Jun 23 20:17:21 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy2_monitor_20000 (17) confirmed on virt-041 (rc=0) Adding stronger constraint preference for node "virt-039" and "dummy2" is live-migrated back to the original node "virt-039". > Jun 23 20:18:11 [1624] virt-039.cluster-qe.lab.eng.brq.redhat.com cib: info: cib_perform_op: ++ /cib/configuration/constraints: <rsc_location id="location-dummy2-virt-039-2" node="virt-039" rsc="dummy2" score="2"/> > Jun 23 20:18:11 [1625] virt-039.cluster-qe.lab.eng.brq.redhat.com stonith-ng: info: update_cib_stonith_devices_v2: Updating device list from the cib: create constraints > Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-039 is online > Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-024 is online > Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-041 is online > Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: warning: unpack_rsc_op_failure: Processing failed migrate_to of dummy2 on virt-039: unknown error | rc=1 > Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: dummy2 (ocf::pacemaker:Dummy): Started virt-041 > Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: RecurringOp: Start recurring monitor (20s) for dummy2 on virt-039 > Jun 23 20:18:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: notice: LogAction: * Migrate dummy2 ( virt-041 -> virt-039 ) > Jun 23 20:18:11 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating migrate_to operation dummy2_migrate_to_0 on virt-041 | action 17 > [...] > Jun 23 20:18:26 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy2_migrate_to_0 (17) confirmed on virt-041 (rc=0) > Jun 23 20:18:26 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating migrate_from operation dummy2_migrate_from_0 locally on virt-039 | action 18 > Jun 23 20:18:26 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_execute: executing - rsc:dummy2 action:migrate_from call_id:59 > [...] > Jun 23 20:18:41 [1626] virt-039.cluster-qe.lab.eng.brq.redhat.com lrmd: info: log_finished: finished - rsc:dummy2 action:migrate_from call_id:59 pid:2891 exit-code:0 exec-time:15051ms queue-time:0ms > Jun 23 20:18:41 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: notice: process_lrm_event: Result of migrate_from operation for dummy2 on virt-039: 0 (ok) | call=59 key=dummy2_migrate_from_0 confirmed=true cib-update=107 > Jun 23 20:18:41 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy2_migrate_from_0 (18) confirmed on virt-039 (rc=0) > Jun 23 20:18:41 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating stop operation dummy2_stop_0 on virt-041 | action 14 > [...] > Jun 23 20:18:56 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy2_stop_0 (14) confirmed on virt-041 (rc=0) > Jun 23 20:18:56 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: notice: te_rsc_command: Initiating monitor operation dummy2_monitor_20000 locally on virt-039 | action 16 > [...] > Jun 23 20:19:11 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: process_lrm_event: Result of monitor operation for dummy2 on virt-039: 0 (ok) | call=61 key=dummy2_monitor_20000 confirmed=false cib-update=109 > Jun 23 20:19:11 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: match_graph_event: Action dummy2_monitor_20000 (16) confirmed on virt-039 (rc=0) After 2 minutes, cluster-recheck-interval lapses and "dummy2" is still running on original node "virt-039". > Jun 23 20:21:11 [1629] virt-039.cluster-qe.lab.eng.brq.redhat.com crmd: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (120000ms) > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-039 is online > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-024 is online > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: determine_online_status: Node virt-041 is online > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: fence-virt-024 (stonith:fence_xvm): Started virt-024 > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: fence-virt-039 (stonith:fence_xvm): Started virt-039 > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: fence-virt-041 (stonith:fence_xvm): Started virt-041 > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: dummy1 (ocf::pacemaker:Dummy): Started virt-024 > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: common_print: dummy2 (ocf::pacemaker:Dummy): Started virt-039 > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: LogActions: Leave fence-virt-024 (Started virt-024) > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: LogActions: Leave fence-virt-039 (Started virt-039) > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: LogActions: Leave fence-virt-041 (Started virt-041) > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: LogActions: Leave dummy1 (Started virt-024) > Jun 23 20:21:11 [1628] virt-039.cluster-qe.lab.eng.brq.redhat.com pengine: info: LogActions: Leave dummy2 (Started virt-039) marking verified in pacemaker-1.1.23-1.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3951 |