Bug 2228955

Summary: Race condition when DC and attribute writer are both shutting down
Product: Red Hat Enterprise Linux 8 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 8.4CC: cfeist, cluster-maint, jrehova, msmazova
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: 8.9Flags: pm-rhel: mirror+
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: pacemaker-2.1.6-8.el8 Doc Type: Bug Fix
Doc Text:
Cause: A node's attribute manager writes all its transient node attributes from memory to the CIB after winning the election for attribute writer, even if its node has requested shutdown. Consequence: If a node is DC, requests shutdown, and wins the attribute writer election after its controller has left the cluster but before its attribute manager has left, it can write out its shutdown attribute to the CIB. The next time it rejoins the cluster, it will be immediately shut down. Fix: A node's attribute manager should not write out its attributes after winning an election if shutdown has been requested for its node. Result: A leaving DC node does not have an unexpected shutdown the next time it rejoins.
Story Points: ---
Clone Of: 2228933
: 2229013 (view as bug list) Environment:
Last Closed: 2023-11-14 15:32:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: 2.1.7
Embargoed:
Bug Depends On: 2228933    
Bug Blocks: 2229013    

Description Ken Gaillot 2023-08-03 18:06:09 UTC
+++ This bug was initially created as a clone of Bug #2228933 +++

Description of problem:

Pacemaker consists of multiple daemons, including the controller and the attribute manager, which both elect one node to have a special role (the Designated Controller a.k.a. DC and the attribute writer).

When a node needs to be shut down, a "shutdown" transient node attribute is created for it.

Transient node attributes are stored both in the CIB and in attribute manager memory. When the DC leaves the cluster, all other nodes remove its transient node attributes from the CIB, including "shutdown". When any node's attribute manager leaves the cluster, its transient node attributes are removed from memory by all other nodes' attribute managers.

When a node wins the attribute writer election, it writes out all its transient node attributes to the CIB.

This creates a race condition when different nodes are the DC and the writer, and both nodes are shutting down while other nodes remain up.

When the DC controller exits, the remaining nodes erase its attributes. However its attribute manager may still be up at this point, and if the former attribute writer leaves at this time, it may win the election for a new attribute writer, and write out its attributes back to the CIB.

Since the shutdown attribute is written back out, the next time the node joins the cluster, it will immediately be shut down.


Version-Release number of selected component (if applicable):


How reproducible: Difficult


Steps to Reproduce:

1. Configure a cluster of at least 5 nodes (so that quorum can be retained after shutting down 2).

2. Ensure that different nodes are DC and attribute writer. The DC can be determined with "crmadmin -D". The attribute writer can be determined by searching /var/log/pacemaker/pacemaker.log on all nodes for the most recent "Recorded local node as attribute writer" message. Restart the existing winner to force a new election until this happens.

3. Shut down the DC and attribute writer at the same time.

Actual results: Sometimes, the CIB will still have a "shutdown" node attribute for the former DC. This can be checked with "pcs cluster cib" and looking under "transient_attributes" in the "node_state" section for the node.


Expected results: The "shutdown" node attribute for the former DC is never present after it leaves the cluster.


Additional info: If this can't be reproduced, it can be sanity-checked only.

Comment 3 Ken Gaillot 2023-08-03 22:34:25 UTC
Fixed upstream as of commit f5263c94

Comment 8 jrehova 2023-08-25 08:39:41 UTC
Version of pacemaker:
> [root@virt-543:~]# rpm -q pacemaker
> pacemaker-2.1.6-7.el8.x86_64

Determining the DC node:
> [root@virt-543:~]# crmadmin -D
> Designated Controller is: virt-544

Determining the attribute writer node --> virt-546:
> [root@virt-543:~]# for n in 543 544 545 546 547; do echo $n; qarsh -l root virt-$n "grep 'Recorded local node as attribute writer' /var/log/pacemaker/pacemaker.log | tail -1"; done
> 543
> Aug 23 14:19:13 virt-543 pacemaker-attrd     [65920] (attrd_declare_winner) 	notice: Recorded local node as attribute writer (was unset)
> 544
> Aug 23 14:19:13 virt-544 pacemaker-attrd     [65845] (attrd_declare_winner) 	notice: Recorded local node as attribute writer (was unset)
> 545
> Aug 23 14:19:13 virt-545 pacemaker-attrd     [65698] (attrd_declare_winner) 	notice: Recorded local node as attribute writer (was unset)
> 546
> Aug 23 14:19:21 virt-546 pacemaker-attrd     [65700] (attrd_declare_winner) 	notice: Recorded local node as attribute writer (was unset)
> 547
> Aug 23 14:19:13 virt-547 pacemaker-attrd     [65497] (attrd_declare_winner) 	notice: Recorded local node as attribute writer (was unset)

Rebooting both DC and attribute writer nodes at the same time:
> [root@virt-544 ~]# reboot
> [root@virt-546 ~]# reboot

Result: "shutdown" attribute is present in the CIB.

> [root@virt-543:~]# pcs cluster cib | xmllint --xpath '//node_state/transient_attributes' -
> <transient_attributes id="1">
>         <instance_attributes id="status-1">
>           <nvpair id="status-1-.feature-set" name="#feature-set" value="3.17.4"/>
>         </instance_attributes>
>       </transient_attributes><transient_attributes id="3">
>         <instance_attributes id="status-3">
>           <nvpair id="status-3-.feature-set" name="#feature-set" value="3.17.4"/>
>         </instance_attributes>
>       </transient_attributes><transient_attributes id="2">
>         <instance_attributes id="status-2">
>           <nvpair id="status-2-.feature-set" name="#feature-set" value="3.17.4"/>
>           <nvpair id="status-2-shutdown" name="shutdown" value="1692793538"/>
>         </instance_attributes>
>       </transient_attributes><transient_attributes id="5">
>         <instance_attributes id="status-5">
>           <nvpair id="status-5-.feature-set" name="#feature-set" value="3.17.4"/>
>         </instance_attributes>
>       </transient_attributes>

Comment 10 Ken Gaillot 2023-08-28 15:19:28 UTC
The original fix was found to be incomplete. The completed fix has been merged in upstream main branch as of commit 58400e27.

Comment 12 jrehova 2023-09-07 13:17:17 UTC
Marking Verified in version pacemaker-2.1.6-8.el8.x86_64.

Comment 14 errata-xmlrpc 2023-11-14 15:32:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:6970