Bug 1338623

Summary: pacemaker does not flush the attrd cache fully after a crm_node -R node removal
Product: Red Hat Enterprise Linux 7 Reporter: Michele Baldessari <michele>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact: Steven J. Levine <slevine>
Priority: urgent    
Version: 7.2CC: abeekhof, cfeist, cluster-maint, dbecker, dciabrin, dmacpher, hbrock, jkortus, kgaillot, mburns, mcornea, michele, mmuehlfe, morazi, phagara, rhel-osp-director-maint, sasha, srevivo
Target Milestone: rcKeywords: Documentation, ZStream
Target Release: 7.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.15-1.2c148ac.git.el7 Doc Type: Release Note
Doc Text:
Pacemaker now removes node attributes from its memory when purging a node that has been removed from the cluster Previously, Pacemaker's node attribute manager removed attribute values from its memory but not the attributes themselves when purging a node that had been removed from the cluster. As a result, if a new node was later added to the cluster with the same node ID, attributes that existed on the original node could not be set for the new node. With this update, Pacemaker now purges the attributes themselves when removing a node and a new node with the same ID encounters no problems with setting attributes.
Story Points: ---
Clone Of: 1326507
: 1344223 (view as bug list) Environment:
Last Closed: 2016-11-03 18:59:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1326507    
Bug Blocks: 1286302, 1344223    

Comment 2 Ken Gaillot 2016-05-25 00:20:13 UTC
Fixed upstream by commits 4c4d8c5 (reverting da17fd0) and c3c3d98 (properly implementing da17fd0's intended purpose)

QA: The issue reported here was caused by a bug introduced by the fix for BZ#1299348, so both this issue and that one need to be tested (to be sure the fix for this one doesn't affect the other).

For the issue here, a reproducer is:

1. Set up a cluster of at least three nodes, and have an extra to use a new node later.

2. Pick one node to remove. Make a note of its node ID in /etc/corosync/corosync.conf. Set a transient node attribute for this node, for example: crm_attribute -N $NODENAME -l reboot -n QA -v 1

3. Stop that node: pcs cluster stop $NODENAME

4. On each of the remaining nodes, remove it from the local corosync configuration: pcs cluster localnode remove $NODENAME && pcs cluster reload corosync

5. On any node, purge it from Pacemaker's caches: crm_node -R $NODENAME --force

6. Add the new node, then check /etc/corosync/corosync.conf to make sure it reuses the original node's ID: pcs cluster node add $NEWNODE

7. Start the cluster on the new node: pcs cluster start $NEWNODE

8. Try to set the same attribute as before on the new node: crm_attribute -N $NEWNODE -l reboot -n QA -v 2

9. Check the value of the attribute you just set: crm_attribute -N $NEWNODE -l reboot -n QA -G

Before the fix, the last command will not return a value. After the fix, it will return the proper value.

To test the other issue (BZ#1299348), configure a cluster with a Pacemaker Remote node, then run "systemctl stop pacemaker_remote" on the node while it is in the cluster. All resources should be moved off the node, and pacemaker_remoted should gracefully stop, without the node being fenced. Additionally, the node should rejoin the cluster after pacemaker_remote is started again (within reconnect_interval if set, or cluster-recheck-interval otherwise).

Comment 6 Patrik Hagara 2016-09-07 13:32:43 UTC
1. Set up a cluster of at least three nodes, and have an extra to use a new node later.

> [root@virt-166 ~]# pcs cluster setup --start --name bz1338623 virt-{166,167,168}

2. Pick one node to remove. Make a note of its node ID in /etc/corosync/corosync.conf. Set a transient node attribute for this node, for example: crm_attribute -N $NODENAME -l reboot -n QA -v 1

(corosync node ID for virt-167 is 2)
> [root@virt-166 ~]# crm_attribute -N virt-167 -l reboot -n QA -v 1

3. Stop that node: pcs cluster stop $NODENAME

> [root@virt-166 ~]# pcs cluster stop virt-167

4. On each of the remaining nodes, remove it from the local corosync configuration: pcs cluster localnode remove $NODENAME && pcs cluster reload corosync

> [root@virt-166 ~]# pcs cluster localnode remove virt-167 && pcs cluster reload corosync
> [root@virt-168 ~]# pcs cluster localnode remove virt-167 && pcs cluster reload corosync

5. On any node, purge it from Pacemaker's caches: crm_node -R $NODENAME --force

> [root@virt-166 ~]# crm_node -R virt-167 --force
(node ID 2 removed from corosync.conf)

6. Add the new node, then check /etc/corosync/corosync.conf to make sure it reuses the original node's ID: pcs cluster node add $NEWNODE

> [root@virt-166 ~]# pcs cluster node add virt-169
(virt-169 reuses corosync node ID 2)

7. Start the cluster on the new node: pcs cluster start $NEWNODE

> [root@virt-166 ~]# pcs cluster start virt-169

8. Try to set the same attribute as before on the new node: crm_attribute -N $NEWNODE -l reboot -n QA -v 2

(old attribute not present for new node with the same corosync ID)
> [root@virt-166 ~]# crm_attribute -N virt-169 -l reboot -n QA -G
> scope=status  name=QA value=(null)
> Error performing operation: No such device or address

> [root@virt-166 ~]# crm_attribute -N virt-169 -l reboot -n QA -v 2

9. Check the value of the attribute you just set: crm_attribute -N $NEWNODE -l reboot -n QA -G

> [root@virt-166 ~]# crm_attribute -N virt-169 -l reboot -n QA -G
> scope=status  name=QA value=2

Querying the transient attribute of a new node with reused corosync node ID returns the expected value. Marking as verified in pacemaker-1.1.15-1.2c148ac.git.el7

Comment 8 errata-xmlrpc 2016-11-03 18:59:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html