Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2228933

Summary:	Race condition when DC and attribute writer are both shutting down
Product:	Red Hat Enterprise Linux 9	Reporter:	Ken Gaillot <kgaillot>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	9.2	CC:	cfeist, cluster-maint, jmarcian, jrehova, msmazova
Target Milestone:	rc	Keywords:	Triaged, ZStream
Target Release:	9.3	Flags:	pm-rhel: mirror+
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	pacemaker-2.1.6-9.el9	Doc Type:	Bug Fix
Doc Text:	Cause: A node's attribute manager writes all its transient node attributes from memory to the CIB after winning the election for attribute writer, even if its node has requested shutdown. Consequence: If a node is DC, requests shutdown, and wins the attribute writer election after its controller has left the cluster but before its attribute manager has left, it can write out its shutdown attribute to the CIB. The next time it rejoins the cluster, it will be immediately shut down. Fix: A node's attribute manager should not write out its attributes after winning an election if shutdown has been requested for its node. Result: A leaving DC node does not have an unexpected shutdown the next time it rejoins.	Story Points:	---
Clone Of:
Clones:	2228955 2229014 (view as bug list)		Environment:
Last Closed:	2023-11-07 08:23:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:	2.1.7
Embargoed:
Bug Depends On:
Bug Blocks:	2228955, 2229014

Description Ken Gaillot 2023-08-03 16:57:54 UTC

Description of problem:

Pacemaker consists of multiple daemons, including the controller and the attribute manager, which both elect one node to have a special role (the Designated Controller a.k.a. DC and the attribute writer).

When a node needs to be shut down, a "shutdown" transient node attribute is created for it.

Transient node attributes are stored both in the CIB and in attribute manager memory. When the DC leaves the cluster, all other nodes remove its transient node attributes from the CIB, including "shutdown". When any node's attribute manager leaves the cluster, its transient node attributes are removed from memory by all other nodes' attribute managers.

When a node wins the attribute writer election, it writes out all its transient node attributes to the CIB.

This creates a race condition when different nodes are the DC and the writer, and both nodes are shutting down while other nodes remain up.

When the DC controller exits, the remaining nodes erase its attributes. However its attribute manager may still be up at this point, and if the former attribute writer leaves at this time, it may win the election for a new attribute writer, and write out its attributes back to the CIB.

Since the shutdown attribute is written back out, the next time the node joins the cluster, it will immediately be shut down.

Version-Release number of selected component (if applicable):

How reproducible: Difficult

Steps to Reproduce:

1. Configure a cluster of at least 5 nodes (so that quorum can be retained after shutting down 2).

2. Ensure that different nodes are DC and attribute writer. The DC can be determined with "crmadmin -D". The attribute writer can be determined by searching /var/log/pacemaker/pacemaker.log on all nodes for the most recent "Recorded local node as attribute writer" message. Restart the existing winner to force a new election until this happens.

3. Shut down the DC and attribute writer at the same time.

Actual results: Sometimes, the CIB will still have a "shutdown" node attribute for the former DC. This can be checked with "pcs cluster cib" and looking under "transient_attributes" in the "node_state" section for the node.

Expected results: The "shutdown" node attribute for the former DC is never present after it leaves the cluster.

Additional info: If this can't be reproduced, it can be sanity-checked only.

Comment 3 Ken Gaillot 2023-08-03 22:36:40 UTC

Fixed upstream as of commit f5263c94

Comment 5 Ken Gaillot 2023-08-09 14:40:15 UTC

*** Bug 2230133 has been marked as a duplicate of this bug. ***

Comment 10 Ken Gaillot 2023-08-28 15:20:49 UTC

The original fix was found to be incomplete. The completed fix has been merged in upstream main branch as of commit 58400e27. New build coming soon.

Comment 12 jrehova 2023-09-07 14:26:17 UTC

Marking Verified as SanityOnly in version pacemaker-2.1.6-9.el9.x86_64.

Comment 14 errata-xmlrpc 2023-11-07 08:23:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:6314