Red Hat Bugzilla – Bug 430127
clustat has inconsistent view of cluster membership
Last modified: 2009-04-16 18:56:13 EDT
Description of problem:
clustat across a 3-node cluster is reporting an inconsistent view of node status.
We have 3 nodes:
node1 and node3 report node2 as being offline.
node2 correctly reports all three nodes as being up and with rgmanager running
On all nodes, 'cman_tool nodes' correctly reports all nodes up.
clusvcadm will happily migrate services to node2, even though clustat reports it
as being down.
Version-Release number of selected component (if applicable):
We have only seen this once, after the steps outlined below.
Steps to Reproduce:
1. Using luci, remove a node from the cluster. clustat will mark the node as
2. Using luci, add the node back into the cluster.
3. Luci will incorrectly configure the node as a cluster of 1, so we manually
copied over the correct cluster.conf from the two remaining nodes and restarted
4. From now on, the node that was removed and then re-added to the cluster will
be seen as offline when clustat is run on the two other nodes, however it is
actually a full cluster member.
Removing and re-adding a node into the cluster should work, both from the luci
point of view, and with clustat having a correct view of the cluster.
It appears that to clear this problem we are going to have to bring down the
Did you try restarting rgmanager ?
Yes, tried restarting rgmanager, and even cman - same problem. Only cleared by
restarting cman on all nodes.
Ok - this is caused by an inconsistency in what CMAN sees.
Nicholas reproduced this, and we were able to figure out that
* clustat output was inconsistent - two nodes thought another was offline; this
time it was node 1 that was offline
* cman_tool nodes were inconsistent
* logs from CLM (openais) were consistent - all 3 nodes were in the most recent
new CLM configuration
* clustat and cman_tool were consistent with each other on a given node (which
This also caused rgmanager to try to migrate a VM to the same node it was
running on, which can't actually happen (and doesn't work, obviously).
Currently, the theory is that because 'uname -n' didn't match any bound IP on
the system that CMAN was kind of just 'picking' interfaces and it just happened
to work sometimes. We're trying using 'cman_tool join -n' by editing
Channel bonding is in use, FWIW.
Assigning with cman_tool join -n gave consistent results.