1684306 – Interrupted live migration will get full start rather than completed migration

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1684306 - Interrupted live migration will get full start rather than completed migration

Summary: Interrupted live migration will get full start rather than completed migration

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.0
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	8.1
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1682116
Blocks:	1695247
TreeView+	depends on / blocked

Reported:	2019-02-28 22:55 UTC by Ken Gaillot
Modified:	2020-11-14 12:13 UTC (History)
CC List:	7 users (show)
Fixed In Version:	pacemaker-2.0.1-5.el8
Doc Type:	Bug Fix
Doc Text:	Cause: The Pacemaker packages in RHEL 8.0.0 record partial live migrations as stops. This is a regression from the packages in RHEL 7. Consequence: If the transition with the live migration is aborted before the migration completes, the cluster will schedule a full start on the destination node, rather than the completion of the migration. This is likely to cause the resource to fail. Fix: Pacemaker correctly records partial live migrations. Result: Live migrations complete successfully even if the transition they start in is interrupted.
Clone Of:
Clones:	1695247 (view as bug list)
Environment:
Last Closed:	2019-11-05 20:57:32 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:3385	0	None	None	None	2019-11-05 20:58:28 UTC

Description Ken Gaillot 2019-02-28 22:55:15 UTC

Description of problem: The Pacemaker packages in RHEL 8.0.0 have a bug that causes them to record partial live migrations as stops. If the transition with the migration is aborted before the migration completes, the cluster will schedule a full start on the destination node, rather than the completion of the migration. This is likely to cause the resource to fail.

Version-Release number of selected component (if applicable): pacemaker-2.0.1-4.el8

How reproducible: Reliably

Steps to Reproduce:
1. Configure and start a cluster with at least 2 nodes.

2. Create a migrateable resource, preferably one that takes a nontrivial amount of time to complete a migrate_to action, for example:

pcs resource create bztest ocf:pacemaker:Dummy op_sleep=10 meta allow-migrate=true

3. Initiate a live migration of the resource, and before the migrate_to action completes, make a change that will abort the transition, for example:

pcs resource move bztest
sleep 3
pcs node attribute $ANY_NODE_NAME foo="$(date)"

5. After the cluster settles, check the logs on the DC to see what action was scheduled on the destination node (if desired, you can confirm there was a "Transition ... aborted" between the migrate_to action and that action). You can also run "crm_mon -o1" to see whether migrate_from occurs in the resource history.

Actual results: The DC logs and crm_mon output will show that a start was scheduled on the destination node.

notice: Initiating migrate_to operation bztest_migrate_to_0 on node1
...
notice: Initiating start operation bztest_start_0 on node2

Expected results: The DC logs and crm_mon output will show that a migrate_from was scheduled on the destination node.

notice: Initiating migrate_to operation bztest_migrate_to_0 on node1
...
notice: Initiating migrate_from operation bztest_migrate_from_0 on node2

Additional info: This is a regression from the version in RHEL 7.

Comment 1 Ken Gaillot 2019-02-28 23:04:16 UTC

Fixed upstream by commit 0dca216

Comment 2 Patrik Hagara 2019-03-01 08:22:04 UTC

qa-ack+, reproducer in bug description

Comment 7 Patrik Hagara 2019-08-05 15:40:01 UTC

before fix (2.0.1-4.el8):
=========================
 
> [root@virt-182 ~]# rpm -q pacemaker
> pacemaker-2.0.1-4.el8.x86_64
> [root@virt-182 ~]# pcs status
> Cluster name: STSRHTS16307
> Stack: corosync
> Current DC: virt-182 (version 2.0.1-4.el8-0eb7991564) - partition with quorum
> Last updated: Mon Aug  5 17:24:59 2019
> Last change: Mon Aug  5 17:15:30 2019 by root via cibadmin on virt-182
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-182 virt-183 virt-184 ]
> 
> Full list of resources:
> 
>  fence-virt-182	(stonith:fence_xvm):	Started virt-182
>  fence-virt-183	(stonith:fence_xvm):	Started virt-183
>  fence-virt-184	(stonith:fence_xvm):	Started virt-184
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> [root@virt-182 ~]# pcs resource create bztest ocf:pacemaker:Dummy op_sleep=10 meta allow-migrate=true
> [root@virt-182 ~]# ANY_NODE_NAME=virt-182
> [root@virt-182 ~]# pcs resource move bztest
> Warning: Creating location constraint cli-ban-bztest-on-virt-182 with a score of -INFINITY for resource bztest on node virt-182.
> This will prevent bztest from running on virt-182 until the constraint is removed. This will be the case even if virt-182 is the last node in the cluster.
> [root@virt-182 ~]# sleep 3
> [root@virt-182 ~]# pcs node attribute $ANY_NODE_NAME foo="$(date)"
> [root@virt-182 ~]# crm_mon -o1
> Stack: corosync
> Current DC: virt-182 (version 2.0.1-4.el8-0eb7991564) - partition with quorum
> Last updated: Mon Aug  5 17:30:08 2019
> Last change: Mon Aug  5 17:25:54 2019 by root via crm_attribute on virt-182
> 
> 3 nodes configured
> 4 resources configured
> 
> Online: [ virt-182 virt-183 virt-184 ]
> 
> Active resources:
> 
>  fence-virt-182	(stonith:fence_xvm):	Started virt-182
>  fence-virt-183	(stonith:fence_xvm):	Started virt-183
>  fence-virt-184	(stonith:fence_xvm):	Started virt-184
>  bztest	(ocf::pacemaker:Dummy):	Started virt-183
> 
> Operations:
> * Node virt-182:
>    fence-virt-182: migration-threshold=1000000
>     + (6) start: rc=0 (ok)
>     + (7) monitor: interval=60000ms rc=0 (ok)
>    bztest: migration-threshold=1000000
>     + (21) monitor: interval=10000ms rc=0 (ok)
>     + (23) stop: rc=0 (ok)
> * Node virt-183:
>    fence-virt-183: migration-threshold=1000000
>     + (10) start: rc=0 (ok)
>     + (11) monitor: interval=60000ms rc=0 (ok)
>    bztest: migration-threshold=1000000
>     + (20) start: rc=0 (ok)
>     + (21) monitor: interval=10000ms rc=0 (ok)
> * Node virt-184:
>    fence-virt-184: migration-threshold=1000000
>     + (14) start: rc=0 (ok)
>     + (15) monitor: interval=60000ms rc=0 (ok)
> [root@virt-182 ~]# grep -E "bztest_(start|migrate)_" /var/log/pacemaker/pacemaker.log
> Aug 05 17:25:21 virt-182 pacemaker-controld  [27253] (te_rsc_command) 	notice: Initiating start operation bztest_start_0 locally on virt-182 | action 13
> Aug 05 17:25:21 virt-182 pacemaker-controld  [27253] (do_lrm_rsc_op) 	info: Performing key=13:4:0:702d9252-6484-4130-aecf-0c80ff803507 op=bztest_start_0
> Aug 05 17:25:21 virt-182 pacemaker-based     [27248] (cib_perform_op) 	info: +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bztest']/lrm_rsc_op[@id='bztest_last_0']:  @operation_key=bztest_start_0, @operation=start, @transition-key=13:4:0:702d9252-6484-4130-aecf-0c80ff803507, @transition-magic=-1:193;13:4:0:702d9252-6484-4130-aecf-0c80ff803507, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1565018721, @last-rc-change=1565018721, @exec
> Aug 05 17:25:31 virt-182 pacemaker-controld  [27253] (process_lrm_event) 	notice: Result of start operation for bztest on virt-182: 0 (ok) | call=20 key=bztest_start_0 confirmed=true cib-update=58
> Aug 05 17:25:31 virt-182 pacemaker-controld  [27253] (match_graph_event) 	info: Action bztest_start_0 (13) confirmed on virt-182 (rc=0)
> Aug 05 17:25:50 virt-182 pacemaker-controld  [27253] (te_rsc_command) 	notice: Initiating migrate_to operation bztest_migrate_to_0 locally on virt-182 | action 14
> Aug 05 17:25:50 virt-182 pacemaker-controld  [27253] (do_lrm_rsc_op) 	info: Performing key=14:5:0:702d9252-6484-4130-aecf-0c80ff803507 op=bztest_migrate_to_0
> Aug 05 17:26:00 virt-182 pacemaker-controld  [27253] (process_lrm_event) 	notice: Result of migrate_to operation for bztest on virt-182: 0 (ok) | call=23 key=bztest_migrate_to_0 confirmed=true cib-update=63
> Aug 05 17:26:00 virt-182 pacemaker-controld  [27253] (te_rsc_command) 	notice: Initiating start operation bztest_start_0 on virt-183 | action 10
> Aug 05 17:26:00 virt-182 pacemaker-based     [27248] (cib_perform_op) 	info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='bztest']/lrm_rsc_op[@id='bztest_last_0']:  @operation_key=bztest_start_0, @operation=start, @transition-key=10:6:0:702d9252-6484-4130-aecf-0c80ff803507, @transition-magic=-1:193;10:6:0:702d9252-6484-4130-aecf-0c80ff803507, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1565018760, @last-rc-change=1565018760, @exec
> Aug 05 17:26:10 virt-182 pacemaker-controld  [27253] (match_graph_event) 	info: Action bztest_start_0 (10) confirmed on virt-183 (rc=0)

migration transition was interrupted by node attribute change and then rescheduled as a start operation, which is incorrect


after fix (2.0.2-2.el8):
========================

> [root@virt-008 ~]# rpm -q pacemaker
> pacemaker-2.0.2-2.el8.x86_64
> [root@virt-008 ~]# pcs status
> Cluster name: STSRHTS22238
> Stack: corosync
> Current DC: virt-017 (version 2.0.2-2.el8-744a30d655) - partition with quorum
> Last updated: Mon Aug  5 16:22:55 2019
> Last change: Mon Aug  5 12:47:32 2019 by root via cibadmin on virt-008
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-008 virt-009 virt-017 ]
> 
> Full list of resources:
> 
>  fence-virt-008 (stonith:fence_xvm):    Started virt-008
>  fence-virt-009 (stonith:fence_xvm):    Started virt-009
>  fence-virt-017 (stonith:fence_xvm):    Started virt-017
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> [root@virt-008 ~]# pcs resource create bztest ocf:pacemaker:Dummy op_sleep=10 meta allow-migrate=true
> [root@virt-008 ~]# ANY_NODE_NAME=virt-017
> [root@virt-008 ~]# pcs resource move bztest
> Warning: Creating location constraint 'cli-ban-bztest-on-virt-008' with a score of -INFINITY for resource bztest on virt-008.
>         This will prevent bztest from running on virt-008 until the constraint is removed
>         This will be the case even if virt-008 is the last node in the cluster
> [root@virt-008 ~]# sleep 3
> [root@virt-008 ~]# pcs node attribute $ANY_NODE_NAME foo="$(date)"

wait for cluster to settle...

> [root@virt-008 ~]# crm_mon -o1
> Stack: corosync
> Current DC: virt-017 (version 2.0.2-2.el8-744a30d655) - partition with quorum
> Last updated: Mon Aug  5 16:31:32 2019
> Last change: Mon Aug  5 16:24:17 2019 by root via crm_attribute on virt-008
> 
> 3 nodes configured
> 4 resources configured
> 
> Online: [ virt-008 virt-009 virt-017 ]
> 
> Active resources:
> 
>  fence-virt-008 (stonith:fence_xvm):    Started virt-008
>  fence-virt-009 (stonith:fence_xvm):    Started virt-009
>  fence-virt-017 (stonith:fence_xvm):    Started virt-017
>  bztest (ocf::pacemaker:Dummy): Started virt-009
> 
> Operations:
> * Node virt-017:
>    fence-virt-017: migration-threshold=1000000
>     + (14) start: rc=0 (ok)
>     + (15) monitor: interval=60000ms rc=0 (ok)
> * Node virt-008:
>    fence-virt-008: migration-threshold=1000000
>     + (6) start: rc=0 (ok)
>     + (7) monitor: interval=60000ms rc=0 (ok)
>    bztest: migration-threshold=1000000
>     + (21) monitor: interval=10000ms rc=0 (ok)
>     + (24) stop: rc=0 (ok)
> * Node virt-009:
>    fence-virt-009: migration-threshold=1000000
>     + (10) start: rc=0 (ok)
>     + (11) monitor: interval=60000ms rc=0 (ok)
>    bztest: migration-threshold=1000000
>     + (20) migrate_from: rc=0 (ok)
>     + (21) monitor: interval=10000ms rc=0 (ok)
> [root@virt-008 ~]# ssh virt-017 'grep -E "bztest_(start|migrate)_" /var/log/pacemaker/pacemaker.log'
> Aug 05 16:23:25 virt-017 pacemaker-controld  [12494] (te_rsc_command)         notice: Initiating start operation bztest_start_0 on virt-008 | action 13
> Aug 05 16:23:25 virt-017 pacemaker-based     [12489] (cib_perform_op)         info: +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bztest']/lrm_rsc_op[@id='bztest_last_0']:  @operation_key=bztest_start_0, @operation=start, @transition-key=13:18:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @transition-magic=-1:193;13:18:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1565015005, @last-rc-change=1565015005, @ex
> Aug 05 16:23:35 virt-017 pacemaker-controld  [12494] (match_graph_event)      info: Action bztest_start_0 (13) confirmed on virt-008 (rc=0)
> Aug 05 16:24:13 virt-017 pacemaker-controld  [12494] (te_rsc_command)         notice: Initiating migrate_to operation bztest_migrate_to_0 on virt-008 | action 14
> Aug 05 16:24:13 virt-017 pacemaker-based     [12489] (cib_perform_op)         info: +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bztest']/lrm_rsc_op[@id='bztest_last_0']:  @operation_key=bztest_migrate_to_0, @operation=migrate_to, @transition-key=14:19:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @transition-magic=-1:193;14:19:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1565015053, @last-rc-change=15650
> Aug 05 16:24:23 virt-017 pacemaker-controld  [12494] (match_graph_event)      info: Action bztest_migrate_to_0 (14) confirmed on virt-008 (rc=0)
> Aug 05 16:24:23 virt-017 pacemaker-controld  [12494] (te_rsc_command)         notice: Initiating migrate_from operation bztest_migrate_from_0 on virt-009 | action 14
> Aug 05 16:24:23 virt-017 pacemaker-based     [12489] (cib_perform_op)         info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='bztest']/lrm_rsc_op[@id='bztest_last_0']:  @operation_key=bztest_migrate_from_0, @operation=migrate_from, @transition-key=14:20:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @transition-magic=-1:193;14:20:0:1e2fc259-cc63-47b7-bb52-8ee2728836c6, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1565015063, @last-rc-change=1
> Aug 05 16:24:33 virt-017 pacemaker-controld  [12494] (match_graph_event)      info: Action bztest_migrate_from_0 (14) confirmed on virt-009 (rc=0)

migration was successful, even after being interrupted by a node attribute change


marking VERIFIED in pacemaker-2.0.2-2.el8

Comment 9 errata-xmlrpc 2019-11-05 20:57:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3385

Note You need to log in before you can comment on or make changes to this bug.