Red Hat Bugzilla – Bug 1338623
pacemaker does not flush the attrd cache fully after a crm_node -R node removal
Last modified: 2017-04-14 05:52:32 EDT
Fixed upstream by commits 4c4d8c5 (reverting da17fd0) and c3c3d98 (properly implementing da17fd0's intended purpose) QA: The issue reported here was caused by a bug introduced by the fix for BZ#1299348, so both this issue and that one need to be tested (to be sure the fix for this one doesn't affect the other). For the issue here, a reproducer is: 1. Set up a cluster of at least three nodes, and have an extra to use a new node later. 2. Pick one node to remove. Make a note of its node ID in /etc/corosync/corosync.conf. Set a transient node attribute for this node, for example: crm_attribute -N $NODENAME -l reboot -n QA -v 1 3. Stop that node: pcs cluster stop $NODENAME 4. On each of the remaining nodes, remove it from the local corosync configuration: pcs cluster localnode remove $NODENAME && pcs cluster reload corosync 5. On any node, purge it from Pacemaker's caches: crm_node -R $NODENAME --force 6. Add the new node, then check /etc/corosync/corosync.conf to make sure it reuses the original node's ID: pcs cluster node add $NEWNODE 7. Start the cluster on the new node: pcs cluster start $NEWNODE 8. Try to set the same attribute as before on the new node: crm_attribute -N $NEWNODE -l reboot -n QA -v 2 9. Check the value of the attribute you just set: crm_attribute -N $NEWNODE -l reboot -n QA -G Before the fix, the last command will not return a value. After the fix, it will return the proper value. To test the other issue (BZ#1299348), configure a cluster with a Pacemaker Remote node, then run "systemctl stop pacemaker_remote" on the node while it is in the cluster. All resources should be moved off the node, and pacemaker_remoted should gracefully stop, without the node being fenced. Additionally, the node should rejoin the cluster after pacemaker_remote is started again (within reconnect_interval if set, or cluster-recheck-interval otherwise).
1. Set up a cluster of at least three nodes, and have an extra to use a new node later. > [root@virt-166 ~]# pcs cluster setup --start --name bz1338623 virt-{166,167,168} 2. Pick one node to remove. Make a note of its node ID in /etc/corosync/corosync.conf. Set a transient node attribute for this node, for example: crm_attribute -N $NODENAME -l reboot -n QA -v 1 (corosync node ID for virt-167 is 2) > [root@virt-166 ~]# crm_attribute -N virt-167 -l reboot -n QA -v 1 3. Stop that node: pcs cluster stop $NODENAME > [root@virt-166 ~]# pcs cluster stop virt-167 4. On each of the remaining nodes, remove it from the local corosync configuration: pcs cluster localnode remove $NODENAME && pcs cluster reload corosync > [root@virt-166 ~]# pcs cluster localnode remove virt-167 && pcs cluster reload corosync > [root@virt-168 ~]# pcs cluster localnode remove virt-167 && pcs cluster reload corosync 5. On any node, purge it from Pacemaker's caches: crm_node -R $NODENAME --force > [root@virt-166 ~]# crm_node -R virt-167 --force (node ID 2 removed from corosync.conf) 6. Add the new node, then check /etc/corosync/corosync.conf to make sure it reuses the original node's ID: pcs cluster node add $NEWNODE > [root@virt-166 ~]# pcs cluster node add virt-169 (virt-169 reuses corosync node ID 2) 7. Start the cluster on the new node: pcs cluster start $NEWNODE > [root@virt-166 ~]# pcs cluster start virt-169 8. Try to set the same attribute as before on the new node: crm_attribute -N $NEWNODE -l reboot -n QA -v 2 (old attribute not present for new node with the same corosync ID) > [root@virt-166 ~]# crm_attribute -N virt-169 -l reboot -n QA -G > scope=status name=QA value=(null) > Error performing operation: No such device or address > [root@virt-166 ~]# crm_attribute -N virt-169 -l reboot -n QA -v 2 9. Check the value of the attribute you just set: crm_attribute -N $NEWNODE -l reboot -n QA -G > [root@virt-166 ~]# crm_attribute -N virt-169 -l reboot -n QA -G > scope=status name=QA value=2 Querying the transient attribute of a new node with reused corosync node ID returns the expected value. Marking as verified in pacemaker-1.1.15-1.2c148ac.git.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html