Bug 1193499

Summary: member weirdness when adding/removing nodes
Product: Red Hat Enterprise Linux 6 Reporter: Radek Steiger <rsteiger>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.7CC: abeekhof, cfeist, cluster-maint, cluster-qe, fdinitto, jkortus, kgaillot, kwenning
Target Milestone: rc   
Target Release: 6.8   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.14-1.1.el6 Doc Type: Bug Fix
Doc Text:
Cause: Removed nodes were not consistently purged from all Pacemaker components' peer caches. Consequence: Removing and adding nodes can result in a node ID being recycled, which should be OK but caused daemon crashes due to conflicting information from the former node not being purged from the peer cache. Fix: Peer cache management has been overhauled so that the libcluster library handles node reaping itself rather than relying on the individual components to do it correctly. Result: Recycling node IDs should not cause any problems.
Story Points: ---
Clone Of: 1162727 Environment:
Last Closed: 2016-05-10 23:51:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1162727    
Bug Blocks:    

Description Radek Steiger 2015-02-17 13:25:44 UTC
+++ This bug was initially created as a clone of Bug #1162727 +++


> Description of problem:

Adding and removing nodes may cause an ID collision in RHEL6 pacemaker, just like in it's RHEL7 counterpart. When a newly added node is assigned with an ID that has been previously used by a different node, a collision occurs somewhere in pacemaker caches. I can see messages like these in logs:

Feb 17 13:40:40 virt-091 stonith-ng[6415]:  warning: crm_find_peer: Node 'virt-093' and 'virt-096' share the same cluster nodeid: 3
Feb 17 13:40:42 virt-091 attrd[6417]:  warning: crm_find_peer: Node 'virt-093' and 'virt-096' share the same cluster nodeid: 3
Feb 17 13:40:42 virt-091 cib[6414]:  warning: crm_find_peer: Node 'virt-093' and 'virt-096' share the same cluster nodeid: 3


Chronology o reproducer steps:

1) Tue Feb 17 13:39:34 CET 2015
pcs cluster node remove virt-096 && cman_tool version -r -S

2) Tue Feb 17 13:39:57 CET 2015
pcs cluster node remove virt-093 && cman_tool version -r -S

3) Tue Feb 17 13:40:28 CET 2015
pcs cluster node add virt-096 --start


The initial node ID distribution:

1    virt-091
2    virt-092
3    virt-093
4    virt-094
5    virt-095
6    virt-096

The final ID distribution after all additions/removals:

1    virt-091
2    virt-092
3    virt-096
4    virt-094
5    virt-095

Comment 1 Radek Steiger 2015-02-17 13:26:30 UTC
> Version-Release number of selected component (if applicable):

pacemaker-1.1.12-4.el6.x86_64
pcs-0.9.138-1.el6.x86_64
corosync-1.4.7-1.el6.x86_64
cman-3.0.12.1-68.el6.x86_64

Comment 4 Andrew Beekhof 2015-03-31 01:29:41 UTC
Bug #1162727 has the list of patches required here.

Comment 5 Andrew Beekhof 2015-04-10 01:22:07 UTC
Patches:

0eb41da: Fix: attrd: Remove offline nodes from node cache for "peer-remove" requests 
ba8d3cd: Fix: membership: Prevent use-after-free in reap_crm_member() 
68d5738: Fix: cluster: Remove unknown offline nodes with conflicting unames from node cache 
c97575b: Fix: crmd: Remove state of unknown nodes with conflicting unames from CIB 
50ffa21: Fix: crmd: Remove unknown nodes with conflicting unames from CIB 
ddccf97: Fix: Membership: Detect and resolve nodes that change their ID 
371e79c: Fix: attrd: Clean out the node cache when requested by the admin 
b658b2b: Fix: attrd: Simplify how node deletions happen 
bf15d36: Fix: cib: Avoid nodeid conflicts we don't care about 
30a1ba9: Fix: fencing: Allow nodes to be purged from the member cache 
c8b413f: Fix: crm_node: Correctly remove nodes from the CIB by nodeid 
0b98ef1: Fix: stonith-ng: Correctly track node state 
72b3a9a: Fix: stonith-ng: No reply is needed for CRM_OP_RM_NODE_CACHE 
e48a7a0: Fix: cib: Correctly track node state 
f51c05d: Fix: cluster: Invoke crm_remove_conflicting_peer() only when the new node's uname is being assigned in the node cache 

and the lib/cluster portion of:

8727a4f: Feature: Allow fail-counts to be removed en-mass when the new attrd is in operation

Comment 8 Ken Gaillot 2015-06-02 19:10:10 UTC
A fix for upstream is pending testing but will not make it in time for 6.7.

Comment 10 Ken Gaillot 2015-07-29 20:40:14 UTC
Fixed upstream as of commit 49fd91f.

Comment 18 errata-xmlrpc 2016-05-10 23:51:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0856.html