Description of problem: Conga - modifying a cluster node's cluster membership in a subnet with other clusters results in the wrong cluster.conf Version-Release number of selected component (if applicable): RHEL5-Server-20061027.0 luci-0.8-21.el5 ricci-0.8-21.el5 How reproducible: 100% Steps to Reproduce: 1. On nodes tng3-1.lab.msp.redhat.com thru tng3-5.lab.msp.redhat.com - these nodes are on a subnet with other existing clusters. On the nodes that will comprise the new cluster, the cman service will have these chkconfig settings before the cluster is created: chkconfig --list cman cman 0:off 1:off 2:off 3:off 4:off 5:off 6:off 2. Create a new cluster - for this example, I created a new cluster with one node (tng3-3.lab.msp.redhat.com) - this results in the following entry in the /var/lib/ricci/queue: ---------------------------------------- <?xml version="1.0"?> <batch batch_id="2053224548" status="0"> <module name="rpm" status="0"> <response API_version="1.0" sequence=""> <function_response function_name="install"> <var mutable="false" name="success" type="boolean" value="true"/> </function_response> </response> </module> <module name="reboot" status="0"> <response API_version="1.0" sequence=""> <function_response function_name="reboot_now"> <var mutable="false" name="success" type="boolean" value="true"/> </function_response> </response> </module> <module name="cluster" status="0"> <response API_version="1.0" sequence=""> <function_response function_name="set_cluster.conf"> <var mutable="false" name="success" type="boolean" value="true"/> </function_response> </response> </module> <module name="cluster" status="0"> <response API_version="1.0" sequence=""> <function_response function_name="start_node"> <var mutable="false" name="success" type="boolean" value="true"/> </function_response> </response> </module> </batch> ---------------------------------------- 3. Node tng3-3.lab.msp.redhat.com is automatically rebooted - after the reboot, the node is correctly functioning as a member of the new cluster. /etc/cluster/cluster.conf contains the following: ---------------------------------------- <?xml version="1.0"?> <cluster alias="oct31_4" config_version="1" name="oct31_4"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="tng3-3.lab.msp.redhat.com" nodeid="1" votes="1"/> </clusternodes> <cman/> <fencedevices/> <rm/> </cluster> ---------------------------------------- 4. So far, so good - now for the problem. Delete the cluster definition via the luci web app. Bug https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213076 prevents the deletion of the /etc/cluster/cluster.conf file on node tng3-3.lab.msp.redhat.com, so rename the file and restart the ricci service. At this point, the cman service has these chkconfig settings: chkconfig --list cman cman 0:off 1:off 2:on 3:on 4:on 5:on 6:off 5) Create a new cluster containing node tng3-3.lab.msp.redhat.com. 6) After the new cluster is created, the node is automatically rebooted. After the reboot, the node's /etc/cluster/cluster.conf does not contain an entry for the just created new cluster. Instead, the file contains the definition of another cluster that exists in the subnet. See attachment #1 [details]. The debug log shows what's happened - ccsd has retrieved one of the existing cluster definitions: ---------------------------------------- Oct 31 03:50:43 tng3-3 modclusterd: startup succeeded Oct 31 03:50:43 tng3-3 clurgmgrd[1947]: <notice> Waiting for CMAN to start Oct 31 03:50:43 tng3-3 oddjobd: oddjobd startup succeeded Oct 31 03:50:44 tng3-3 saslauthd[2008]: detach_tty : master pid is: 2008 Oct 31 03:50:44 tng3-3 saslauthd[2008]: ipc_init : listening on socket: /var/run/saslauthd/mux Oct 31 03:50:44 tng3-3 ricci: startup succeeded Oct 31 03:50:45 tng3-3 ccsd[1590]: Remote copy of cluster.conf is from quorate node. Oct 31 03:50:45 tng3-3 ccsd[1590]: Local version # : 4 Oct 31 03:50:45 tng3-3 ccsd[1590]: Remote version #: 4 ---------------------------------------- As does the following entry in the /var/lib/ricci/queue: ---------------------------------------- <?xml version="1.0"?> <batch batch_id="1391419975" status="4"> <module name="rpm" status="0"> <response API_version="1.0" sequence=""> <function_response function_name="install"> <var mutable="false" name="success" type="boolean" value="true"/> </function_response> </response> </module> <module name="reboot" status="0"> <response API_version="1.0" sequence=""> <function_response function_name="reboot_now"> <var mutable="false" name="success" type="boolean" value="true"/> </function_response> </response> </module> <module name="cluster" status="0"> <response API_version="1.0" sequence=""> <function_response function_name="set_cluster.conf"> <var mutable="false" name="success" type="boolean" value="true"/> </function_response> </response> </module> <module name="cluster" status="4"> <response API_version="1.0" sequence=""> <function_response function_name="start_node"> <var mutable="false" name="success" type="boolean" value="false"/> <var mutable="false" name="error_code" type="int" value="-1"/> <var mutable="false" name="error_description" type="string" value="service cman start failed"/> </function_response> </response> </module> </batch> ---------------------------------------- Actual results: After the node reboots, it's in one of the existing clusters - not the newly created cluster. Expected results: The node should be in the newly created cluster. Additional info: I've been able to avoid this problem by setting the chkconfig values for cman to the following - at the very start of step 4 above: chkconfig --level 2345 cman off
Created attachment 139864 [details] cluster.conf - from one of the existing clusters
Note that node removal (step 4 above) wasn't completed, cman was left running and enabled to start at boot. That is the root of this bug. It is tempting to mark as NOTABUG, but, to make Conga more robust (new node could be astray - as has been shown above) this needs to be addressed.
Fix for this one is simple and concerns luci alone: - insert "chkconfig cluster_services off" call (already existent in modservice module) between "install" and "reboot" phases (both for cluster creation and node addition) - "Cluster creation" and "add node" status pages need minor changes; the "disabling daemons" phase belongs under "installation phase", therefore the only change is in a backend function that retrieves status (needs to group results of first two calls as one).
fixed in -HEAD
Cannot recreate the problem with: luci-0.8-30.el5 ricci-0.8-30.el5 Marking the bz as verified.