Bug 213266 - Conga - modifying a cluster node's cluster membership in a subnet with other clusters results in the wrong cluster.conf
Summary: Conga - modifying a cluster node's cluster membership in a subnet with other ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: conga
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Ryan McCabe
QA Contact: Corey Marthaler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-10-31 15:30 UTC by Len DiMaggio
Modified: 2009-04-16 22:42 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-23 16:41:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cluster.conf - from one of the existing clusters (3.41 KB, text/plain)
2006-10-31 15:30 UTC, Len DiMaggio
no flags Details

Description Len DiMaggio 2006-10-31 15:30:48 UTC
Description of problem:

Conga - modifying a cluster node's cluster membership in a subnet with other
clusters results in the wrong cluster.conf

Version-Release number of selected component (if applicable):
RHEL5-Server-20061027.0
luci-0.8-21.el5
ricci-0.8-21.el5

How reproducible:
100%

Steps to Reproduce:

1. On nodes tng3-1.lab.msp.redhat.com thru tng3-5.lab.msp.redhat.com - these
nodes are on a subnet with other existing clusters.

On the nodes that will comprise the new cluster, the cman service will have
these chkconfig settings before the cluster is created:

chkconfig --list cman
cman            0:off   1:off   2:off   3:off   4:off   5:off   6:off

2. Create a new cluster - for this example, I created a new cluster with one
node (tng3-3.lab.msp.redhat.com) - this results in the following entry in the
/var/lib/ricci/queue:

----------------------------------------
<?xml version="1.0"?>
<batch batch_id="2053224548" status="0">
        <module name="rpm" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="install">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="reboot" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="reboot_now">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="cluster" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="set_cluster.conf">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="cluster" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="start_node">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
</batch>
----------------------------------------

3. Node tng3-3.lab.msp.redhat.com is automatically rebooted - after the reboot,
the node is correctly functioning as a member of the new cluster. 

/etc/cluster/cluster.conf contains the following:

----------------------------------------
<?xml version="1.0"?>
<cluster alias="oct31_4" config_version="1" name="oct31_4">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="tng3-3.lab.msp.redhat.com" nodeid="1" votes="1"/>
        </clusternodes>
        <cman/>
        <fencedevices/>
        <rm/>
</cluster>
----------------------------------------

4. So far, so good - now for the problem. Delete the cluster definition via the
luci web app. 

Bug https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213076 prevents the
deletion of the /etc/cluster/cluster.conf file on node
tng3-3.lab.msp.redhat.com, so rename the file and restart the ricci service. 

At this point, the cman service has these chkconfig settings:

chkconfig --list cman
cman            0:off   1:off   2:on    3:on    4:on    5:on    6:off

5) Create a new cluster containing node tng3-3.lab.msp.redhat.com. 

6) After the new cluster is created, the node is automatically rebooted. After
the reboot, the node's /etc/cluster/cluster.conf does not contain an entry for
the just created new cluster. Instead, the file contains the definition of
another cluster that exists in the subnet. See attachment #1 [details].

The debug log shows what's happened - ccsd has retrieved one of the existing
cluster definitions:

----------------------------------------
Oct 31 03:50:43 tng3-3 modclusterd: startup succeeded
Oct 31 03:50:43 tng3-3 clurgmgrd[1947]: <notice> Waiting for CMAN to start
Oct 31 03:50:43 tng3-3 oddjobd: oddjobd startup succeeded
Oct 31 03:50:44 tng3-3 saslauthd[2008]: detach_tty      : master pid is: 2008
Oct 31 03:50:44 tng3-3 saslauthd[2008]: ipc_init        : listening on socket:
/var/run/saslauthd/mux
Oct 31 03:50:44 tng3-3 ricci: startup succeeded
Oct 31 03:50:45 tng3-3 ccsd[1590]: Remote copy of cluster.conf is from quorate node.
Oct 31 03:50:45 tng3-3 ccsd[1590]:  Local version # : 4
Oct 31 03:50:45 tng3-3 ccsd[1590]:  Remote version #: 4
----------------------------------------

As does the following entry in the /var/lib/ricci/queue:

----------------------------------------
<?xml version="1.0"?>
<batch batch_id="1391419975" status="4">
        <module name="rpm" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="install">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="reboot" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="reboot_now">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="cluster" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="set_cluster.conf">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="cluster" status="4">
                <response API_version="1.0" sequence="">
                        <function_response function_name="start_node">
                                <var mutable="false" name="success"
type="boolean" value="false"/>
                                <var mutable="false" name="error_code"
type="int" value="-1"/>
                                <var mutable="false" name="error_description"
type="string" value="service cman start failed"/>
                        </function_response>
                </response>
        </module>
</batch>
----------------------------------------

Actual results:
After the node reboots, it's in one of the existing clusters - not the newly
created cluster.

Expected results:
The node should be in the newly created cluster.

Additional info:

I've been able to avoid this problem by setting the chkconfig values for cman to
the following - at the very start of step 4 above:

chkconfig --level 2345 cman off

Comment 1 Len DiMaggio 2006-10-31 15:30:48 UTC
Created attachment 139864 [details]
cluster.conf - from one of the existing clusters

Comment 2 Stanko Kupcevic 2006-11-06 19:32:38 UTC
Note that node removal (step 4 above) wasn't completed, cman was left running
and enabled to start at boot. That is the root of this bug. 

It is tempting to mark as NOTABUG, but, to make Conga more robust (new node
could be astray - as has been shown above) this needs to be addressed. 


Comment 3 Stanko Kupcevic 2006-11-06 19:37:07 UTC
Fix for this one is simple and concerns luci alone: 

 - insert "chkconfig cluster_services off" call (already existent in modservice
module) between "install" and "reboot" phases (both for cluster creation and
node addition)

 - "Cluster creation" and "add node" status pages need minor changes; the
"disabling daemons" phase belongs under "installation phase", therefore the only
change is in a backend function that retrieves status (needs to group results of
first two calls as one). 


Comment 4 Ryan McCabe 2006-11-12 02:10:32 UTC
fixed in -HEAD

Comment 5 Len DiMaggio 2007-01-24 15:57:47 UTC
Cannot recreate the problem with:
luci-0.8-30.el5
ricci-0.8-30.el5

Marking the bz as verified.


Note You need to log in before you can comment on or make changes to this bug.