Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 430127

Summary: clustat has inconsistent view of cluster membership
Product: Red Hat Enterprise Linux 5 Reporter: Nick Strugnell <nstrug>
Component: cmanAssignee: Lon Hohberger <lhh>
Status: CLOSED NOTABUG QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 5.1CC: cluster-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-01-25 17:17:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nick Strugnell 2008-01-24 17:43:18 UTC
Description of problem:

clustat across a 3-node cluster is reporting an inconsistent view of node status.

We have 3 nodes:

node1
node2
node3

node1 and node3 report node2 as being offline.

node2 correctly reports all three nodes as being up and with rgmanager running

On all nodes, 'cman_tool nodes' correctly reports all nodes up.

clusvcadm will happily migrate services to node2, even though clustat reports it
as being down.


Version-Release number of selected component (if applicable):
rgmanager-2.0.31-1.el5

How reproducible:
We have only seen this once, after the steps outlined below.



Steps to Reproduce:
1. Using luci, remove a node from the cluster. clustat will mark the node as
being estranged
2. Using luci, add the node back into the cluster. 
3. Luci will incorrectly configure the node as a cluster of 1, so we manually
copied over the correct cluster.conf from the two remaining nodes and restarted
services.
4. From now on, the node that was removed and then re-added to the cluster will
be seen as offline when clustat is run on the two other nodes, however it is
actually a full cluster member.
  
Actual results:
See above

Expected results:
Removing and re-adding a node into the cluster should work, both from the luci
point of view, and with clustat having a correct view of the cluster.

Additional info:
It appears that to clear this problem we are going to have to bring down the
entire cluster.

Comment 1 Lon Hohberger 2008-01-24 20:36:15 UTC
Did you try restarting rgmanager ?

Comment 2 Nick Strugnell 2008-01-25 16:15:11 UTC
Yes, tried restarting rgmanager, and even cman - same problem. Only cleared by
restarting cman on all nodes.

Comment 3 Lon Hohberger 2008-01-25 17:04:34 UTC
Ok - this is caused by an inconsistency in what CMAN sees.

Nicholas reproduced this, and we were able to figure out that
* clustat output was inconsistent - two nodes thought another was offline; this
time it was node 1 that was offline
* cman_tool nodes were inconsistent
* logs from CLM (openais) were consistent - all 3 nodes were in the most recent
new CLM configuration
* clustat and cman_tool were consistent with each other on a given node (which
is expected)

This also caused rgmanager to try to migrate a VM to the same node it was
running on, which can't actually happen (and doesn't work, obviously).

Currently, the theory is that because 'uname -n' didn't match any bound IP on
the system that CMAN was kind of just 'picking' interfaces and it just happened
to work sometimes.  We're trying using 'cman_tool join -n' by editing
/etc/init.d/cman.

Comment 4 Lon Hohberger 2008-01-25 17:06:50 UTC
Channel bonding is in use, FWIW.

Comment 5 Lon Hohberger 2008-01-25 17:17:37 UTC
Assigning with cman_tool join -n gave consistent results.