Bug 1684306
Summary: | Interrupted live migration will get full start rather than completed migration | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Ken Gaillot <kgaillot> | |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 8.0 | CC: | abeekhof, aherr, cfeist, cluster-maint, kgaillot, phagara, toneata | |
Target Milestone: | rc | Keywords: | Regression, ZStream | |
Target Release: | 8.1 | |||
Hardware: | All | |||
OS: | All | |||
Whiteboard: | ||||
Fixed In Version: | pacemaker-2.0.1-5.el8 | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The Pacemaker packages in RHEL 8.0.0 record partial live migrations as stops. This is a regression from the packages in RHEL 7.
Consequence: If the transition with the live migration is aborted before the migration completes, the cluster will schedule a full start on the destination node, rather than the completion of the migration. This is likely to cause the resource to fail.
Fix: Pacemaker correctly records partial live migrations.
Result: Live migrations complete successfully even if the transition they start in is interrupted.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1695247 (view as bug list) | Environment: | ||
Last Closed: | 2019-11-05 20:57:32 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1682116 | |||
Bug Blocks: | 1695247 |
Description
Ken Gaillot
2019-02-28 22:55:15 UTC
Fixed upstream by commit 0dca216 qa-ack+, reproducer in bug description before fix (2.0.1-4.el8): ========================= > [root@virt-182 ~]# rpm -q pacemaker > pacemaker-2.0.1-4.el8.x86_64 > [root@virt-182 ~]# pcs status > Cluster name: STSRHTS16307 > Stack: corosync > Current DC: virt-182 (version 2.0.1-4.el8-0eb7991564) - partition with quorum > Last updated: Mon Aug 5 17:24:59 2019 > Last change: Mon Aug 5 17:15:30 2019 by root via cibadmin on virt-182 > > 3 nodes configured > 3 resources configured > > Online: [ virt-182 virt-183 virt-184 ] > > Full list of resources: > > fence-virt-182 (stonith:fence_xvm): Started virt-182 > fence-virt-183 (stonith:fence_xvm): Started virt-183 > fence-virt-184 (stonith:fence_xvm): Started virt-184 > > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > [root@virt-182 ~]# pcs resource create bztest ocf:pacemaker:Dummy op_sleep=10 meta allow-migrate=true > [root@virt-182 ~]# ANY_NODE_NAME=virt-182 > [root@virt-182 ~]# pcs resource move bztest > Warning: Creating location constraint cli-ban-bztest-on-virt-182 with a score of -INFINITY for resource bztest on node virt-182. > This will prevent bztest from running on virt-182 until the constraint is removed. This will be the case even if virt-182 is the last node in the cluster. > [root@virt-182 ~]# sleep 3 > [root@virt-182 ~]# pcs node attribute $ANY_NODE_NAME foo="$(date)" > [root@virt-182 ~]# crm_mon -o1 > Stack: corosync > Current DC: virt-182 (version 2.0.1-4.el8-0eb7991564) - partition with quorum > Last updated: Mon Aug 5 17:30:08 2019 > Last change: Mon Aug 5 17:25:54 2019 by root via crm_attribute on virt-182 > > 3 nodes configured > 4 resources configured > > Online: [ virt-182 virt-183 virt-184 ] > > Active resources: > > fence-virt-182 (stonith:fence_xvm): Started virt-182 > fence-virt-183 (stonith:fence_xvm): Started virt-183 > fence-virt-184 (stonith:fence_xvm): Started virt-184 > bztest (ocf::pacemaker:Dummy): Started virt-183 > > Operations: > * Node virt-182: > fence-virt-182: migration-threshold=1000000 > + (6) start: rc=0 (ok) > + (7) monitor: interval=60000ms rc=0 (ok) > bztest: migration-threshold=1000000 > + (21) monitor: interval=10000ms rc=0 (ok) > + (23) stop: rc=0 (ok) > * Node virt-183: > fence-virt-183: migration-threshold=1000000 > + (10) start: rc=0 (ok) > + (11) monitor: interval=60000ms rc=0 (ok) > bztest: migration-threshold=1000000 > + (20) start: rc=0 (ok) > + (21) monitor: interval=10000ms rc=0 (ok) > * Node virt-184: > fence-virt-184: migration-threshold=1000000 > + (14) start: rc=0 (ok) > + (15) monitor: interval=60000ms rc=0 (ok) > [root@virt-182 ~]# grep -E "bztest_(start|migrate)_" /var/log/pacemaker/pacemaker.log > Aug 05 17:25:21 virt-182 pacemaker-controld [27253] (te_rsc_command) notice: Initiating start operation bztest_start_0 locally on virt-182 | action 13 > Aug 05 17:25:21 virt-182 pacemaker-controld [27253] (do_lrm_rsc_op) info: Performing key=13:4:0:702d9252-6484-4130-aecf-0c80ff803507 op=bztest_start_0 > Aug 05 17:25:21 virt-182 pacemaker-based [27248] (cib_perform_op) info: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bztest']/lrm_rsc_op[@id='bztest_last_0']: @operation_key=bztest_start_0, @operation=start, @transition-key=13:4:0:702d9252-6484-4130-aecf-0c80ff803507, @transition-magic=-1:193;13:4:0:702d9252-6484-4130-aecf-0c80ff803507, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1565018721, @last-rc-change=1565018721, @exec > Aug 05 17:25:31 virt-182 pacemaker-controld [27253] (process_lrm_event) notice: Result of start operation for bztest on virt-182: 0 (ok) | call=20 key=bztest_start_0 confirmed=true cib-update=58 > Aug 05 17:25:31 virt-182 pacemaker-controld [27253] (match_graph_event) info: Action bztest_start_0 (13) confirmed on virt-182 (rc=0) > Aug 05 17:25:50 virt-182 pacemaker-controld [27253] (te_rsc_command) notice: Initiating migrate_to operation bztest_migrate_to_0 locally on virt-182 | action 14 > Aug 05 17:25:50 virt-182 pacemaker-controld [27253] (do_lrm_rsc_op) info: Performing key=14:5:0:702d9252-6484-4130-aecf-0c80ff803507 op=bztest_migrate_to_0 > Aug 05 17:26:00 virt-182 pacemaker-controld [27253] (process_lrm_event) notice: Result of migrate_to operation for bztest on virt-182: 0 (ok) | call=23 key=bztest_migrate_to_0 confirmed=true cib-update=63 > Aug 05 17:26:00 virt-182 pacemaker-controld [27253] (te_rsc_command) notice: Initiating start operation bztest_start_0 on virt-183 | action 10 > Aug 05 17:26:00 virt-182 pacemaker-based [27248] (cib_perform_op) info: + /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='bztest']/lrm_rsc_op[@id='bztest_last_0']: @operation_key=bztest_start_0, @operation=start, @transition-key=10:6:0:702d9252-6484-4130-aecf-0c80ff803507, @transition-magic=-1:193;10:6:0:702d9252-6484-4130-aecf-0c80ff803507, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1565018760, @last-rc-change=1565018760, @exec > Aug 05 17:26:10 virt-182 pacemaker-controld [27253] (match_graph_event) info: Action bztest_start_0 (10) confirmed on virt-183 (rc=0) migration transition was interrupted by node attribute change and then rescheduled as a start operation, which is incorrect after fix (2.0.2-2.el8): ======================== > [root@virt-008 ~]# rpm -q pacemaker > pacemaker-2.0.2-2.el8.x86_64 > [root@virt-008 ~]# pcs status > Cluster name: STSRHTS22238 > Stack: corosync > Current DC: virt-017 (version 2.0.2-2.el8-744a30d655) - partition with quorum > Last updated: Mon Aug 5 16:22:55 2019 > Last change: Mon Aug 5 12:47:32 2019 by root via cibadmin on virt-008 > > 3 nodes configured > 3 resources configured > > Online: [ virt-008 virt-009 virt-017 ] > > Full list of resources: > > fence-virt-008 (stonith:fence_xvm): Started virt-008 > fence-virt-009 (stonith:fence_xvm): Started virt-009 > fence-virt-017 (stonith:fence_xvm): Started virt-017 > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > [root@virt-008 ~]# pcs resource create bztest ocf:pacemaker:Dummy op_sleep=10 meta allow-migrate=true > [root@virt-008 ~]# ANY_NODE_NAME=virt-017 > [root@virt-008 ~]# pcs resource move bztest > Warning: Creating location constraint 'cli-ban-bztest-on-virt-008' with a score of -INFINITY for resource bztest on virt-008. > This will prevent bztest from running on virt-008 until the constraint is removed > This will be the case even if virt-008 is the last node in the cluster > [root@virt-008 ~]# sleep 3 > [root@virt-008 ~]# pcs node attribute $ANY_NODE_NAME foo="$(date)" wait for cluster to settle... > [root@virt-008 ~]# crm_mon -o1 > Stack: corosync > Current DC: virt-017 (version 2.0.2-2.el8-744a30d655) - partition with quorum > Last updated: Mon Aug 5 16:31:32 2019 > Last change: Mon Aug 5 16:24:17 2019 by root via crm_attribute on virt-008 > > 3 nodes configured > 4 resources configured > > Online: [ virt-008 virt-009 virt-017 ] > > Active resources: > > fence-virt-008 (stonith:fence_xvm): Started virt-008 > fence-virt-009 (stonith:fence_xvm): Started virt-009 > fence-virt-017 (stonith:fence_xvm): Started virt-017 > bztest (ocf::pacemaker:Dummy): Started virt-009 > > Operations: > * Node virt-017: > fence-virt-017: migration-threshold=1000000 > + (14) start: rc=0 (ok) > + (15) monitor: interval=60000ms rc=0 (ok) > * Node virt-008: > fence-virt-008: migration-threshold=1000000 > + (6) start: rc=0 (ok) > + (7) monitor: interval=60000ms rc=0 (ok) > bztest: migration-threshold=1000000 > + (21) monitor: interval=10000ms rc=0 (ok) > + (24) stop: rc=0 (ok) > * Node virt-009: > fence-virt-009: migration-threshold=1000000 > + (10) start: rc=0 (ok) > + (11) monitor: interval=60000ms rc=0 (ok) > bztest: migration-threshold=1000000 > + (20) migrate_from: rc=0 (ok) > + (21) monitor: interval=10000ms rc=0 (ok) > [root@virt-008 ~]# ssh virt-017 'grep -E "bztest_(start|migrate)_" /var/log/pacemaker/pacemaker.log' > Aug 05 16:23:25 virt-017 pacemaker-controld [12494] (te_rsc_command) notice: Initiating start operation bztest_start_0 on virt-008 | action 13 > Aug 05 16:23:25 virt-017 pacemaker-based [12489] (cib_perform_op) info: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bztest']/lrm_rsc_op[@id='bztest_last_0']: @operation_key=bztest_start_0, @operation=start, @transition-key=13:18:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @transition-magic=-1:193;13:18:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1565015005, @last-rc-change=1565015005, @ex > Aug 05 16:23:35 virt-017 pacemaker-controld [12494] (match_graph_event) info: Action bztest_start_0 (13) confirmed on virt-008 (rc=0) > Aug 05 16:24:13 virt-017 pacemaker-controld [12494] (te_rsc_command) notice: Initiating migrate_to operation bztest_migrate_to_0 on virt-008 | action 14 > Aug 05 16:24:13 virt-017 pacemaker-based [12489] (cib_perform_op) info: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bztest']/lrm_rsc_op[@id='bztest_last_0']: @operation_key=bztest_migrate_to_0, @operation=migrate_to, @transition-key=14:19:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @transition-magic=-1:193;14:19:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1565015053, @last-rc-change=15650 > Aug 05 16:24:23 virt-017 pacemaker-controld [12494] (match_graph_event) info: Action bztest_migrate_to_0 (14) confirmed on virt-008 (rc=0) > Aug 05 16:24:23 virt-017 pacemaker-controld [12494] (te_rsc_command) notice: Initiating migrate_from operation bztest_migrate_from_0 on virt-009 | action 14 > Aug 05 16:24:23 virt-017 pacemaker-based [12489] (cib_perform_op) info: + /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='bztest']/lrm_rsc_op[@id='bztest_last_0']: @operation_key=bztest_migrate_from_0, @operation=migrate_from, @transition-key=14:20:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @transition-magic=-1:193;14:20:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1565015063, @last-rc-change=1 > Aug 05 16:24:33 virt-017 pacemaker-controld [12494] (match_graph_event) info: Action bztest_migrate_from_0 (14) confirmed on virt-009 (rc=0) migration was successful, even after being interrupted by a node attribute change marking VERIFIED in pacemaker-2.0.2-2.el8 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3385 |