2010084 – VirtualDomain migration run twice and resource failed

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2010084 - VirtualDomain migration run twice and resource failed

Summary: VirtualDomain migration run twice and resource failed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	8.9
Assignee:	Reid Wahl
QA Contact:	cluster-qe
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Duplicates (1):	2010086 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-03 15:28 UTC by tbsky
Modified:	2023-11-14 16:51 UTC (History)
CC List:	5 users (show)
Fixed In Version:	pacemaker-2.1.6-1.el8
Doc Type:	Bug Fix
Doc Text:	.Pacemaker Designated Controller elections no longer finalized until all pending actions are complete When a cluster elects a new Designated Controller (DC), all nodes send their current history to the new DC, which saves it to the CIB. As a consequence, if actions were already in progress when a new DC is elected, and the actions finish after the nodes send their current history to the new DC, the actions' results could be lost. With this fix, DC elections are not finalized until all pending actions are complete and no action results are lost.
Clone Of:
Environment:
Last Closed:	2023-11-14 15:32:34 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
migration log at source node (656.53 KB, text/plain) 2021-10-03 15:28 UTC, tbsky	no flags	Details
migration log at destination node (447.44 KB, text/plain) 2021-10-03 15:37 UTC, tbsky	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CLUSTERQE-6692	None	None	None	2023-05-16 13:11:01 UTC
Red Hat Issue Tracker	RHELPLAN-98713	None	None	None	2021-10-03 15:29:10 UTC
Red Hat Product Errata	RHEA-2023:6970	None	None	None	2023-11-14 15:33:21 UTC

Description tbsky 2021-10-03 15:28:52 UTC

Created attachment 1828623 [details]
migration log at source node

Created attachment 1828623 [details]
migration log at source node

Description of problem:

sometimes pacemaker decide to run VirtualDomain live migration twice. the second migration will fail and cause the resource to be stopped.

please check the migration-source.log file logged at node-1. resource kvm-16 start migrate_to at "09:15:32" and success at "09:15:33". but later at "09:15:33" pacemaker want to migrate again. it failed because the vm 'Not active locally'.

check the migration-destination.log logged at node-2. the first migration seems ignored since it is busy doing other things.

maybe the situation is like bug 2000595?


Version-Release number of selected component (if applicable):
pacemaker-2.1.0-8.el8.x86_64


How reproducible:
sometimes

Steps to Reproduce:
1. create vm and migrate them at different nodes concurrently for testing
2. sometimes the migration will fail
3.

Actual results:
migration sometimes fail

Expected results:
migration always success

Additional info:

Comment 1 tbsky 2021-10-03 15:37:33 UTC

Created attachment 1828624 [details]
migration log at destination node

Comment 2 Ken Gaillot 2021-10-04 14:36:09 UTC

*** Bug 2010086 has been marked as a duplicate of this bug. ***

Comment 3 Ken Gaillot 2021-10-04 15:32:56 UTC

(In reply to tbsky from comment #0)
> please check the migration-source.log file logged at node-1. resource kvm-16
> start migrate_to at "09:15:32" and success at "09:15:33". but later at
> "09:15:33" pacemaker want to migrate again. it failed because the vm 'Not
> active locally'.
> 
> check the migration-destination.log logged at node-2. the first migration
> seems ignored since it is busy doing other things.
> 
> maybe the situation is like bug 2000595?

The situation has the same trigger, the DC election is being repeated frequently, which causes each node to re-record its resource history in the CIB status section. It's a timing issue that goes like:

Oct 02 09:15:32 love-1-test.example.com pacemaker-controld  [3325405] (te_rsc_command)  notice: Initiating migrate_to operation kvm-16_migrate_to_0 on love-2-test.example.com | action 41

^ The first migration is started


Oct 02 09:15:32 love-1-test.example.com pacemaker-based     [3325400] (cib_process_request)     info: Completed cib_replace operation for section configuration: OK (rc=0, origin=love-1-test.example.com/cibadmin/2, version=0.29071.0)
...
Oct 02 09:15:32 love-1-test.example.com pacemaker-controld  [3325405] (do_state_transition)     info: State transition S_TRANSITION_ENGINE -> S_ELECTION | input=I_ELECTION cause=C_FSA_INTERNAL origin=do_cib_replaced

^ Something modifies the configuration with a full replacement (likely a pcs command, I'm guessing in a script or cron somewhere), which triggers a new DC election.


Oct 02 09:15:33 love-1-test.example.com pacemaker-controld  [3325405] (process_graph_event)     info: Transition 14127 action 41 (kvm-16_migrate_to_0 on love-2-test.example.com) confirmed: ok | rc=0 call-id=19369

^ The migration completes successfully and is recorded in the resource history


Oct 02 09:15:33 love-1-test.example.com pacemaker-controld  [3325405] (controld_delete_node_state)      info: Deleting resource history for node love-2-test.example.com (via CIB call 142828) | xpath=//node_state[@uname='love-2-test.example.com']/lrm

^ However the node's resource history is immediately erased due to the election


Oct 02 09:15:33 love-1-test.example.com pacemaker-based     [3325400] (cib_perform_op)  info: ++                                       <lrm_resource id="kvm-16" type="VirtualDomain" class="ocf" provider="heartbeat">
Oct 02 09:15:33 love-1-test.example.com pacemaker-based     [3325400] (cib_perform_op)  info: ++                                         <lrm_rsc_op id="kvm-16_last_failure_0" operation_key="kvm-16_monitor_0" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.11.0" transition-key="33:14062:7:4a7f8335-f28c-4644-9e9d-fa408252ce19" transition-magic="0:0;33:14062:7:4a7f8335-f28c-4644-9e9d-fa408252ce19" exit-reason="" on_node="love-2-test.example.com" call-id="19236" rc-code="0" op-status=
Oct 02 09:15:33 love-1-test.example.com pacemaker-based     [3325400] (cib_perform_op)  info: ++                                         <lrm_rsc_op id="kvm-16_last_0" operation_key="kvm-16_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.11.0" transition-key="36:14123:0:4a7f8335-f28c-4644-9e9d-fa408252ce19" transition-magic="0:0;36:14123:0:4a7f8335-f28c-4644-9e9d-fa408252ce19" exit-reason="" on_node="love-2-test.example.com" call-id="19364" rc-code="0" op-status="0" interval
Oct 02 09:15:33 love-1-test.example.com pacemaker-based     [3325400] (cib_perform_op)  info: ++                                       </lrm_resource>

^ When the node updates its resource history, it doesn't include the migration result, so the cluster thinks the migration still needs to be done


I suspect this a timing issue related to the election-triggered history refresh and the migration action happening at the same time. The sequence appears to be:

1. The migration command completes and a new DC election is initiated around the same time.
2. For the election, the node grabs its resource history and sends it to the DC (before the migration result has been recorded).
3. For the migration, the node receives the result and records it in the resource history.
4. The DC receives the originally sent resource history, and overwrites the (newer) resource history in the CIB with it.

The fix for bz2000595 (erasing the old history and overwriting it with the new in one operation) won't help here since the problem is a result arriving after the new history has already been grabbed. I'm not sure at this point what the fix should be. Perhaps the newly elected DC should re-request the resource history if the most recent action result arrived after the timestamp in the history.

Comment 4 Ken Gaillot 2021-10-04 15:34:32 UTC

The best workaround in the meantime will be to identify what's repeatedly replacing the CIB configuration, and see if there's a way to reduce that frequency

Comment 5 tbsky 2021-10-04 15:55:53 UTC

thanks again for the detailed analyze. I run scripts to create/migrate/delete vm at two nodes for testing cluster stability. the scripts will lock each other. so at anytime there will be only one "pcs" command which can run at whole cluster. unfortunately the result need time to spread, so locking didn't prevent racing. 

to workaround with bz2000595 I run "crm_mon -1" multiple times until I get the correct one. but this kind of racing seems hard to workaround.

Comment 6 Ken Gaillot 2021-10-04 16:05:48 UTC

One possibility would be to use the --wait option with the pcs commands (the ones that support it). That will wait until the cluster settles before returning. Another way of doing the same thing would be to run "crm_resource --wait".

Comment 7 tbsky 2021-10-04 23:59:00 UTC

indeed pcs has many command which has "--wait" option. I will try to test with it. thanks a lot for the hint!

Comment 9 Reid Wahl 2023-03-23 23:18:44 UTC

Fixed upstream by commit 39a497e

Comment 10 tbsky 2023-03-24 01:48:15 UTC

thanks a lot for the fix!
the story of the commit is a good read. 
it is great that we don't need to keep racing in mind when doing jobs.

Comment 11 RHEL Program Management 2023-04-03 07:28:07 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 12 Reid Wahl 2023-04-03 08:08:04 UTC

Why does it keep doing this...

Comment 22 Markéta Smazová 2023-07-19 10:24:00 UTC

verified as SanityOnly in pacemaker-2.1.6-1.el8

Comment 25 errata-xmlrpc 2023-11-14 15:32:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:6970

Note You need to log in before you can comment on or make changes to this bug.