Description of problem: clustat across a 3-node cluster is reporting an inconsistent view of node status. We have 3 nodes: node1 node2 node3 node1 and node3 report node2 as being offline. node2 correctly reports all three nodes as being up and with rgmanager running On all nodes, 'cman_tool nodes' correctly reports all nodes up. clusvcadm will happily migrate services to node2, even though clustat reports it as being down. Version-Release number of selected component (if applicable): rgmanager-2.0.31-1.el5 How reproducible: We have only seen this once, after the steps outlined below. Steps to Reproduce: 1. Using luci, remove a node from the cluster. clustat will mark the node as being estranged 2. Using luci, add the node back into the cluster. 3. Luci will incorrectly configure the node as a cluster of 1, so we manually copied over the correct cluster.conf from the two remaining nodes and restarted services. 4. From now on, the node that was removed and then re-added to the cluster will be seen as offline when clustat is run on the two other nodes, however it is actually a full cluster member. Actual results: See above Expected results: Removing and re-adding a node into the cluster should work, both from the luci point of view, and with clustat having a correct view of the cluster. Additional info: It appears that to clear this problem we are going to have to bring down the entire cluster.
Did you try restarting rgmanager ?
Yes, tried restarting rgmanager, and even cman - same problem. Only cleared by restarting cman on all nodes.
Ok - this is caused by an inconsistency in what CMAN sees. Nicholas reproduced this, and we were able to figure out that * clustat output was inconsistent - two nodes thought another was offline; this time it was node 1 that was offline * cman_tool nodes were inconsistent * logs from CLM (openais) were consistent - all 3 nodes were in the most recent new CLM configuration * clustat and cman_tool were consistent with each other on a given node (which is expected) This also caused rgmanager to try to migrate a VM to the same node it was running on, which can't actually happen (and doesn't work, obviously). Currently, the theory is that because 'uname -n' didn't match any bound IP on the system that CMAN was kind of just 'picking' interfaces and it just happened to work sometimes. We're trying using 'cman_tool join -n' by editing /etc/init.d/cman.
Channel bonding is in use, FWIW.
Assigning with cman_tool join -n gave consistent results.