Description of problem: cman doesn't resync its node information after you add a node by doing a ccs_tool update command with a new version of the cluster.conf file? This causes a problem when starting the new node into the cluster. CCSD gets a version mismatch if I copy the new cluster.conf file to the new node versus the version the cluster is formed around. It then tries to form its own cluster around that version of the file, which causes its fencing agent to hang. [root@kanderso-xen-01 cluster]# diff cluster.conf.1node cluster.conf 2c2 < <cluster name="ka-xen-cluster" config_version="32"> --- > <cluster name="ka-xen-cluster" config_version="31"> 31,35d30 < <clusternode name="kanderso-xen-06.lab.msp.redhat.com" votes="1" nodeid="10"> < <fence> < <method name="1"><device name="xvm" domain="kanderso-xen-06"/></method> < </fence> < </clusternode> [root@kanderso-xen-01 cluster]# cman_tool status Version: 6.0.1 Config Version: 31 Cluster Name: ka-xen-cluster Cluster Id: 15028 Cluster Member: Yes Cluster Generation: 24 Membership state: Cluster-Member Nodes: 9 Expected votes: 9 Total votes: 9 Quorum: 5 Active subsystems: 6 Flags: Ports Bound: 0 Node name: kanderso-xen-01.lab.msp.redhat.com Node ID: 1 Multicast addresses: 239.192.58.238 Node addresses: 10.15.85.21 [root@kanderso-xen-01 cluster]# ccs_tool update cluster.conf.1node Config file updated from version 31 to 32 Update complete. [root@kanderso-xen-01 cluster]# ccs_tool lsnode Cluster name: ka-xen-cluster, config_version: 32 Nodename Votes Nodeid Fencetype kanderso-xen-01.lab.msp.redhat.com 1 1 xvm kanderso-xen-02.lab.msp.redhat.com 1 2 xvm kanderso-xen-03.lab.msp.redhat.com 1 3 xvm kanderso-xen-04.lab.msp.redhat.com 1 4 xvm kanderso-xen-05.lab.msp.redhat.com 1 5 xvm kanderso-xen-06.lab.msp.redhat.com 1 10 xvm kanderso-xen-22.lab.msp.redhat.com 1 6 xvm kanderso-xen-23.lab.msp.redhat.com 1 7 xvm kanderso-xen-24.lab.msp.redhat.com 1 8 xvm kanderso-xen-25.lab.msp.redhat.com 1 9 xvm [root@kanderso-xen-01 cluster]# cman_tool status Version: 6.0.1 Config Version: 31 Cluster Name: ka-xen-cluster Cluster Id: 15028 Cluster Member: Yes Cluster Generation: 24 Membership state: Cluster-Member Nodes: 9 Expected votes: 9 Total votes: 9 Quorum: 5 Active subsystems: 6 Flags: Ports Bound: 0 Node name: kanderso-xen-01.lab.msp.redhat.com Node ID: 1 Multicast addresses: 239.192.58.238 Node addresses: 10.15.85.21 [root@kanderso-xen-01 cluster]# cman_tool join cman_tool: Node is already active [root@kanderso-xen-01 cluster]# cman_tool status Version: 6.0.1 Config Version: 31 Cluster Name: ka-xen-cluster Cluster Id: 15028 Cluster Member: Yes Cluster Generation: 24 Membership state: Cluster-Member Nodes: 9 Expected votes: 9 Total votes: 9 Quorum: 5 Active subsystems: 6 Flags: Ports Bound: 0 Node name: kanderso-xen-01.lab.msp.redhat.com Node ID: 1 Multicast addresses: 239.192.58.238 Node addresses: 10.15.85.21 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I opened this to track it, even if it is just a configuration command that I am missing.
Doing a service cman restart on a node in the existing cluster also causes the errors to occur. The initscript will hang trying to starting fence daemon.
However, after doing the service cman restart on one node, the other remaining nodes in the cluster now show a new version of the configuration: [root@kanderso-xen-02 ~]# cman_tool nodes Node Sts Inc Joined Name 1 X 12 kanderso-xen-01.lab.msp.redhat.com 2 M 4 2006-11-27 17:10:57 kanderso-xen-02.lab.msp.redhat.com 3 M 20 2006-11-27 17:10:58 kanderso-xen-03.lab.msp.redhat.com 4 M 24 2006-11-27 17:10:59 kanderso-xen-04.lab.msp.redhat.com 5 M 16 2006-11-27 17:10:58 kanderso-xen-05.lab.msp.redhat.com 6 M 20 2006-11-27 17:10:58 kanderso-xen-22.lab.msp.redhat.com 7 M 16 2006-11-27 17:10:58 kanderso-xen-23.lab.msp.redhat.com 8 M 12 2006-11-27 17:10:57 kanderso-xen-24.lab.msp.redhat.com 9 M 12 2006-11-27 17:10:57 kanderso-xen-25.lab.msp.redhat.com 10 X 0 kanderso-xen-06.lab.msp.redhat.com [root@kanderso-xen-02 ~]# cman_tool status Version: 6.0.1 Config Version: 32 Cluster Name: ka-xen-cluster Cluster Id: 15028 Cluster Member: Yes Cluster Generation: 28 Membership state: Cluster-Member Nodes: 9 Expected votes: 9 Total votes: 8 Quorum: 5 Active subsystems: 6 Flags: Ports Bound: 0 Node name: kanderso-xen-02.lab.msp.redhat.com Node ID: 2 Multicast addresses: 239.192.58.238 Node addresses: 10.15.85.22 And then starting the cluster software on the new node, caused this node to be reconfigured and back in operation. Very strange and everything is running correctly again with 10 nodes in the cluster.
The correct procedure for updating the config file is either: 1. ccs_tool update <file>; cman_tool version -r <version> or 2. Simply start a new node with the later config file (the others will spot the new version and read it from ccs) so you got half of 1, and 2 rescued you :-) Number 1 is the faff - it should be a single-step process and you shouldn't have to remember the version number for the second command. It's (in RHEL5) trivial to make ccs_tool tell cman that the file has been changed so I'll do that.
Checking in update.c; /cvs/cluster/cluster/ccs/ccs_tool/update.c,v <-- update.c new revision: 1.9; previous revision: 1.8 done Checking in update.c; /cvs/cluster/cluster/ccs/ccs_tool/update.c,v <-- update.c new revision: 1.8.2.1; previous revision: 1.8 done
Ok, so operator error. Do we want to change the process for rhel5 since it is the same as for rhel4? What happens if someone does the cman_tool version command after doing the ccs_tool update with the new changes? At this point, am leaning towards not changing this for the release.
I think it's a nice change to have (though I've left it off the RHEL50 branch). the extra step seems a bit pointless as we know all the information. And if the user does do the extra step with the new code it's harmless. The rest of new cman has code to reduce the amount of manual intervention needed with ccs updates and this is a logical step I feel.
Add it to RHEL50 branch as well given it will be harmless and is heading in the right direction.
Added to RHEL50: Checking in update.c; /cvs/cluster/cluster/ccs/ccs_tool/update.c,v <-- update.c new revision: 1.8.4.1; previous revision: 1.8 done
A package has been built which should help the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you.
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.