Bug 1193499

Summary:	member weirdness when adding/removing nodes
Product:	Red Hat Enterprise Linux 6	Reporter:	Radek Steiger <rsteiger>
Component:	pacemaker	Assignee:	Andrew Beekhof <abeekhof>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.7	CC:	abeekhof, cfeist, cluster-maint, cluster-qe, fdinitto, jkortus, kgaillot, kwenning
Target Milestone:	rc
Target Release:	6.8
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pacemaker-1.1.14-1.1.el6	Doc Type:	Bug Fix
Doc Text:	Cause: Removed nodes were not consistently purged from all Pacemaker components' peer caches. Consequence: Removing and adding nodes can result in a node ID being recycled, which should be OK but caused daemon crashes due to conflicting information from the former node not being purged from the peer cache. Fix: Peer cache management has been overhauled so that the libcluster library handles node reaping itself rather than relying on the individual components to do it correctly. Result: Recycling node IDs should not cause any problems.	Story Points:	---
Clone Of:	1162727	Environment:
Last Closed:	2016-05-10 23:51:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1162727
Bug Blocks:

Description Radek Steiger 2015-02-17 13:25:44 UTC

+++ This bug was initially created as a clone of Bug #1162727 +++


> Description of problem:

Adding and removing nodes may cause an ID collision in RHEL6 pacemaker, just like in it's RHEL7 counterpart. When a newly added node is assigned with an ID that has been previously used by a different node, a collision occurs somewhere in pacemaker caches. I can see messages like these in logs:

Feb 17 13:40:40 virt-091 stonith-ng[6415]:  warning: crm_find_peer: Node 'virt-093' and 'virt-096' share the same cluster nodeid: 3
Feb 17 13:40:42 virt-091 attrd[6417]:  warning: crm_find_peer: Node 'virt-093' and 'virt-096' share the same cluster nodeid: 3
Feb 17 13:40:42 virt-091 cib[6414]:  warning: crm_find_peer: Node 'virt-093' and 'virt-096' share the same cluster nodeid: 3


Chronology o reproducer steps:

1) Tue Feb 17 13:39:34 CET 2015
pcs cluster node remove virt-096 && cman_tool version -r -S

2) Tue Feb 17 13:39:57 CET 2015
pcs cluster node remove virt-093 && cman_tool version -r -S

3) Tue Feb 17 13:40:28 CET 2015
pcs cluster node add virt-096 --start


The initial node ID distribution:

1    virt-091
2    virt-092
3    virt-093
4    virt-094
5    virt-095
6    virt-096

The final ID distribution after all additions/removals:

1    virt-091
2    virt-092
3    virt-096
4    virt-094
5    virt-095

Comment 1 Radek Steiger 2015-02-17 13:26:30 UTC

> Version-Release number of selected component (if applicable):

pacemaker-1.1.12-4.el6.x86_64
pcs-0.9.138-1.el6.x86_64
corosync-1.4.7-1.el6.x86_64
cman-3.0.12.1-68.el6.x86_64

Comment 4 Andrew Beekhof 2015-03-31 01:29:41 UTC

Bug #1162727 has the list of patches required here.

Comment 5 Andrew Beekhof 2015-04-10 01:22:07 UTC

Patches:

0eb41da: Fix: attrd: Remove offline nodes from node cache for "peer-remove" requests 
ba8d3cd: Fix: membership: Prevent use-after-free in reap_crm_member() 
68d5738: Fix: cluster: Remove unknown offline nodes with conflicting unames from node cache 
c97575b: Fix: crmd: Remove state of unknown nodes with conflicting unames from CIB 
50ffa21: Fix: crmd: Remove unknown nodes with conflicting unames from CIB 
ddccf97: Fix: Membership: Detect and resolve nodes that change their ID 
371e79c: Fix: attrd: Clean out the node cache when requested by the admin 
b658b2b: Fix: attrd: Simplify how node deletions happen 
bf15d36: Fix: cib: Avoid nodeid conflicts we don't care about 
30a1ba9: Fix: fencing: Allow nodes to be purged from the member cache 
c8b413f: Fix: crm_node: Correctly remove nodes from the CIB by nodeid 
0b98ef1: Fix: stonith-ng: Correctly track node state 
72b3a9a: Fix: stonith-ng: No reply is needed for CRM_OP_RM_NODE_CACHE 
e48a7a0: Fix: cib: Correctly track node state 
f51c05d: Fix: cluster: Invoke crm_remove_conflicting_peer() only when the new node's uname is being assigned in the node cache 

and the lib/cluster portion of:

8727a4f: Feature: Allow fail-counts to be removed en-mass when the new attrd is in operation

Comment 8 Ken Gaillot 2015-06-02 19:10:10 UTC

A fix for upstream is pending testing but will not make it in time for 6.7.

Comment 10 Ken Gaillot 2015-07-29 20:40:14 UTC

Fixed upstream as of commit 49fd91f.

Comment 18 errata-xmlrpc 2016-05-10 23:51:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0856.html