Bug 430127 - clustat has inconsistent view of cluster membership
clustat has inconsistent view of cluster membership
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
All Linux
low Severity low
: rc
: ---
Assigned To: Lon Hohberger
Depends On:
  Show dependency treegraph
Reported: 2008-01-24 12:43 EST by Nick Strugnell
Modified: 2009-04-16 18:56 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-01-25 12:17:37 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Nick Strugnell 2008-01-24 12:43:18 EST
Description of problem:

clustat across a 3-node cluster is reporting an inconsistent view of node status.

We have 3 nodes:


node1 and node3 report node2 as being offline.

node2 correctly reports all three nodes as being up and with rgmanager running

On all nodes, 'cman_tool nodes' correctly reports all nodes up.

clusvcadm will happily migrate services to node2, even though clustat reports it
as being down.

Version-Release number of selected component (if applicable):

How reproducible:
We have only seen this once, after the steps outlined below.

Steps to Reproduce:
1. Using luci, remove a node from the cluster. clustat will mark the node as
being estranged
2. Using luci, add the node back into the cluster. 
3. Luci will incorrectly configure the node as a cluster of 1, so we manually
copied over the correct cluster.conf from the two remaining nodes and restarted
4. From now on, the node that was removed and then re-added to the cluster will
be seen as offline when clustat is run on the two other nodes, however it is
actually a full cluster member.
Actual results:
See above

Expected results:
Removing and re-adding a node into the cluster should work, both from the luci
point of view, and with clustat having a correct view of the cluster.

Additional info:
It appears that to clear this problem we are going to have to bring down the
entire cluster.
Comment 1 Lon Hohberger 2008-01-24 15:36:15 EST
Did you try restarting rgmanager ?
Comment 2 Nick Strugnell 2008-01-25 11:15:11 EST
Yes, tried restarting rgmanager, and even cman - same problem. Only cleared by
restarting cman on all nodes.
Comment 3 Lon Hohberger 2008-01-25 12:04:34 EST
Ok - this is caused by an inconsistency in what CMAN sees.

Nicholas reproduced this, and we were able to figure out that
* clustat output was inconsistent - two nodes thought another was offline; this
time it was node 1 that was offline
* cman_tool nodes were inconsistent
* logs from CLM (openais) were consistent - all 3 nodes were in the most recent
new CLM configuration
* clustat and cman_tool were consistent with each other on a given node (which
is expected)

This also caused rgmanager to try to migrate a VM to the same node it was
running on, which can't actually happen (and doesn't work, obviously).

Currently, the theory is that because 'uname -n' didn't match any bound IP on
the system that CMAN was kind of just 'picking' interfaces and it just happened
to work sometimes.  We're trying using 'cman_tool join -n' by editing
Comment 4 Lon Hohberger 2008-01-25 12:06:50 EST
Channel bonding is in use, FWIW.
Comment 5 Lon Hohberger 2008-01-25 12:17:37 EST
Assigning with cman_tool join -n gave consistent results.

Note You need to log in before you can comment on or make changes to this bug.