Bug 213266 - Conga - modifying a cluster node's cluster membership in a subnet with other clusters results in the wrong cluster.conf
Conga - modifying a cluster node's cluster membership in a subnet with other ...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: conga (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Ryan McCabe
Corey Marthaler
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-10-31 10:30 EST by Len DiMaggio
Modified: 2009-04-16 18:42 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-23 11:41:46 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
cluster.conf - from one of the existing clusters (3.41 KB, text/plain)
2006-10-31 10:30 EST, Len DiMaggio
no flags Details

  None (edit)
Description Len DiMaggio 2006-10-31 10:30:48 EST
Description of problem:

Conga - modifying a cluster node's cluster membership in a subnet with other
clusters results in the wrong cluster.conf

Version-Release number of selected component (if applicable):
RHEL5-Server-20061027.0
luci-0.8-21.el5
ricci-0.8-21.el5

How reproducible:
100%

Steps to Reproduce:

1. On nodes tng3-1.lab.msp.redhat.com thru tng3-5.lab.msp.redhat.com - these
nodes are on a subnet with other existing clusters.

On the nodes that will comprise the new cluster, the cman service will have
these chkconfig settings before the cluster is created:

chkconfig --list cman
cman            0:off   1:off   2:off   3:off   4:off   5:off   6:off

2. Create a new cluster - for this example, I created a new cluster with one
node (tng3-3.lab.msp.redhat.com) - this results in the following entry in the
/var/lib/ricci/queue:

----------------------------------------
<?xml version="1.0"?>
<batch batch_id="2053224548" status="0">
        <module name="rpm" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="install">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="reboot" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="reboot_now">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="cluster" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="set_cluster.conf">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="cluster" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="start_node">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
</batch>
----------------------------------------

3. Node tng3-3.lab.msp.redhat.com is automatically rebooted - after the reboot,
the node is correctly functioning as a member of the new cluster. 

/etc/cluster/cluster.conf contains the following:

----------------------------------------
<?xml version="1.0"?>
<cluster alias="oct31_4" config_version="1" name="oct31_4">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="tng3-3.lab.msp.redhat.com" nodeid="1" votes="1"/>
        </clusternodes>
        <cman/>
        <fencedevices/>
        <rm/>
</cluster>
----------------------------------------

4. So far, so good - now for the problem. Delete the cluster definition via the
luci web app. 

Bug https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213076 prevents the
deletion of the /etc/cluster/cluster.conf file on node
tng3-3.lab.msp.redhat.com, so rename the file and restart the ricci service. 

At this point, the cman service has these chkconfig settings:

chkconfig --list cman
cman            0:off   1:off   2:on    3:on    4:on    5:on    6:off

5) Create a new cluster containing node tng3-3.lab.msp.redhat.com. 

6) After the new cluster is created, the node is automatically rebooted. After
the reboot, the node's /etc/cluster/cluster.conf does not contain an entry for
the just created new cluster. Instead, the file contains the definition of
another cluster that exists in the subnet. See attachment #1 [details].

The debug log shows what's happened - ccsd has retrieved one of the existing
cluster definitions:

----------------------------------------
Oct 31 03:50:43 tng3-3 modclusterd: startup succeeded
Oct 31 03:50:43 tng3-3 clurgmgrd[1947]: <notice> Waiting for CMAN to start
Oct 31 03:50:43 tng3-3 oddjobd: oddjobd startup succeeded
Oct 31 03:50:44 tng3-3 saslauthd[2008]: detach_tty      : master pid is: 2008
Oct 31 03:50:44 tng3-3 saslauthd[2008]: ipc_init        : listening on socket:
/var/run/saslauthd/mux
Oct 31 03:50:44 tng3-3 ricci: startup succeeded
Oct 31 03:50:45 tng3-3 ccsd[1590]: Remote copy of cluster.conf is from quorate node.
Oct 31 03:50:45 tng3-3 ccsd[1590]:  Local version # : 4
Oct 31 03:50:45 tng3-3 ccsd[1590]:  Remote version #: 4
----------------------------------------

As does the following entry in the /var/lib/ricci/queue:

----------------------------------------
<?xml version="1.0"?>
<batch batch_id="1391419975" status="4">
        <module name="rpm" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="install">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="reboot" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="reboot_now">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="cluster" status="0">
                <response API_version="1.0" sequence="">
                        <function_response function_name="set_cluster.conf">
                                <var mutable="false" name="success"
type="boolean" value="true"/>
                        </function_response>
                </response>
        </module>
        <module name="cluster" status="4">
                <response API_version="1.0" sequence="">
                        <function_response function_name="start_node">
                                <var mutable="false" name="success"
type="boolean" value="false"/>
                                <var mutable="false" name="error_code"
type="int" value="-1"/>
                                <var mutable="false" name="error_description"
type="string" value="service cman start failed"/>
                        </function_response>
                </response>
        </module>
</batch>
----------------------------------------

Actual results:
After the node reboots, it's in one of the existing clusters - not the newly
created cluster.

Expected results:
The node should be in the newly created cluster.

Additional info:

I've been able to avoid this problem by setting the chkconfig values for cman to
the following - at the very start of step 4 above:

chkconfig --level 2345 cman off
Comment 1 Len DiMaggio 2006-10-31 10:30:48 EST
Created attachment 139864 [details]
cluster.conf - from one of the existing clusters
Comment 2 Stanko Kupcevic 2006-11-06 14:32:38 EST
Note that node removal (step 4 above) wasn't completed, cman was left running
and enabled to start at boot. That is the root of this bug. 

It is tempting to mark as NOTABUG, but, to make Conga more robust (new node
could be astray - as has been shown above) this needs to be addressed. 
Comment 3 Stanko Kupcevic 2006-11-06 14:37:07 EST
Fix for this one is simple and concerns luci alone: 

 - insert "chkconfig cluster_services off" call (already existent in modservice
module) between "install" and "reboot" phases (both for cluster creation and
node addition)

 - "Cluster creation" and "add node" status pages need minor changes; the
"disabling daemons" phase belongs under "installation phase", therefore the only
change is in a backend function that retrieves status (needs to group results of
first two calls as one). 
Comment 4 Ryan McCabe 2006-11-11 21:10:32 EST
fixed in -HEAD
Comment 5 Len DiMaggio 2007-01-24 10:57:47 EST
Cannot recreate the problem with:
luci-0.8-30.el5
ricci-0.8-30.el5

Marking the bz as verified.

Note You need to log in before you can comment on or make changes to this bug.