Bug 430127 - clustat has inconsistent view of cluster membership
Summary: clustat has inconsistent view of cluster membership
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.1
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Lon Hohberger
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-01-24 17:43 UTC by Nick Strugnell
Modified: 2009-04-16 22:56 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-01-25 17:17:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Nick Strugnell 2008-01-24 17:43:18 UTC
Description of problem:

clustat across a 3-node cluster is reporting an inconsistent view of node status.

We have 3 nodes:

node1
node2
node3

node1 and node3 report node2 as being offline.

node2 correctly reports all three nodes as being up and with rgmanager running

On all nodes, 'cman_tool nodes' correctly reports all nodes up.

clusvcadm will happily migrate services to node2, even though clustat reports it
as being down.


Version-Release number of selected component (if applicable):
rgmanager-2.0.31-1.el5

How reproducible:
We have only seen this once, after the steps outlined below.



Steps to Reproduce:
1. Using luci, remove a node from the cluster. clustat will mark the node as
being estranged
2. Using luci, add the node back into the cluster. 
3. Luci will incorrectly configure the node as a cluster of 1, so we manually
copied over the correct cluster.conf from the two remaining nodes and restarted
services.
4. From now on, the node that was removed and then re-added to the cluster will
be seen as offline when clustat is run on the two other nodes, however it is
actually a full cluster member.
  
Actual results:
See above

Expected results:
Removing and re-adding a node into the cluster should work, both from the luci
point of view, and with clustat having a correct view of the cluster.

Additional info:
It appears that to clear this problem we are going to have to bring down the
entire cluster.

Comment 1 Lon Hohberger 2008-01-24 20:36:15 UTC
Did you try restarting rgmanager ?

Comment 2 Nick Strugnell 2008-01-25 16:15:11 UTC
Yes, tried restarting rgmanager, and even cman - same problem. Only cleared by
restarting cman on all nodes.

Comment 3 Lon Hohberger 2008-01-25 17:04:34 UTC
Ok - this is caused by an inconsistency in what CMAN sees.

Nicholas reproduced this, and we were able to figure out that
* clustat output was inconsistent - two nodes thought another was offline; this
time it was node 1 that was offline
* cman_tool nodes were inconsistent
* logs from CLM (openais) were consistent - all 3 nodes were in the most recent
new CLM configuration
* clustat and cman_tool were consistent with each other on a given node (which
is expected)

This also caused rgmanager to try to migrate a VM to the same node it was
running on, which can't actually happen (and doesn't work, obviously).

Currently, the theory is that because 'uname -n' didn't match any bound IP on
the system that CMAN was kind of just 'picking' interfaces and it just happened
to work sometimes.  We're trying using 'cman_tool join -n' by editing
/etc/init.d/cman.

Comment 4 Lon Hohberger 2008-01-25 17:06:50 UTC
Channel bonding is in use, FWIW.

Comment 5 Lon Hohberger 2008-01-25 17:17:37 UTC
Assigning with cman_tool join -n gave consistent results.


Note You need to log in before you can comment on or make changes to this bug.