Bug 175139 - cman finds old cluster.conf, cannot add new node
cman finds old cluster.conf, cannot add new node
Status: CLOSED NOTABUG
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: ccs (Show other bugs)
4
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-12-06 17:05 EST by Dennis Preston
Modified: 2009-04-16 16:04 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-06-21 10:24:12 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Dennis Preston 2005-12-06 17:05:29 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

Description of problem:
Cannot add new cluster node, new cluster.conf gets overwritten by previous one.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
Start with a running 2 node cluster, edit cluster.conf to add the 3rd node and increment config_version. force copy new cluster.conf to all nodes

do:
ccs_tool update /etc/cluster/cluster.conf
cman_tool version -r <new_number>

either boot the system or start cluster services,
the new cluster.conf gets overwritten by the earlier version and the join fails.

If I scp the cluster.conf file during rejection, the node will join.

  

Actual Results:  [root@pikes03 cluster]# service ccsd start
Starting ccsd:[  OK  ]
[root@pikes03 cluster]# service cman start
Starting cman:CMAN <CVS> (built Nov 14 2005 11:02:37) installed
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
cwalk_reset_bcm_chip: enter
cwalk_reset_bcm_chip: i=0 ptr=00000100e32b1380
cwalk_reset_bcm_chip: i=1 ptr=00000100e2d5a380
general protection fault: 0000 [1] SMP

Entering kdb (current=0x000001007da9b7f0, pid 4408) on processor 1 Oops: <NULL>
due to oops @ 0xffffffffa01ab6cc
     r15 = 0xffffffffa01c5b20      r14 = 0x0000010002ff2c40
     r13 = 0x0000010002f637e0      r12 = 0x0000000000000000
     rbp = 0x0014001700400006      rbx = 0xffffffffa01c5bc0
     r11 = 0x000001007db5fc80      r10 = 0x0000000000000202
      r9 = 0x0000000000000202       r8 = 0x000001007db7e000
     rax = 0x0014001700400006      rcx = 0x00000100e0fd34f8
     rdx = 0x00010102464c457f      rsi = 0x000000000000006c
     rdi = 0xffffffffa01c5a80 orig_rax = 0xffffffffffffffff
     rip = 0xffffffffa01ab6cc       cs = 0x0000000000000010
  eflags = 0x0000000000010203      rsp = 0x000001007db7fd80
      ss = 0x000001007db7e000 &regs = 0x000001007db7fce8


On console of active cluster member
[root@pikes02 cluster]# CMAN: Join request from pikes03 rejected, exceeds two node limit
CMAN: Join request from pikes03 rejected, exceeds two node limit
CMAN: Join request from pikes03 rejected, exceeds two node limit
CMAN: Join request from pikes03 rejected, exceeds two node limit
CMAN: Join request from pikes03 rejected, exceeds two node limit
CMAN: Join request from pikes03 rejected, exceeds two node limit
CMAN: Join request from pikes03 rejected, exceeds two node limit
CMAN: Join request from pikes03 rejected, exceeds two node limit
CMAN: Join request from pikes03 rejected, exceeds two node limit
CMAN: Join request from pikes03 rejected, exceeds two node limit
CMAN: Join request from pikes03 rejected, exceeds two node limit


Expected Results:  Node should have joined cluster and mounted gfs filesystems

Additional info:

Cluster node 1:

[root@pikes01 ~]# cman_tool version
5.0.1 config 6

[root@pikes01 ~]# more /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="6" name="pikes">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="180"/>
        <clusternodes>
                <clusternode name="pikes01" votes="1" nodeid="1">
                        <fence>
                                <method name="1">
                                        <device name="iGrid" nodename="pikes01-mgmt" spname="pikes01-sp"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="pikes02" votes="1" nodeid="2">
                        <fence>
                                <method name="1">
                                        <device name="iGrid" nodename="pikes02-mgmt" spname="pikes02-sp"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="pikes03" votes="1" nodeid="3">
                        <fence>
                                <method name="1">
                                        <device name="iGrid" nodename="pikes03-mgmt" spname="pikes03-sp"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>


Cluster Node 2:

[root@pikes02 ~]# more /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="6" name="pikes">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="180"/>
        <clusternodes>
                <clusternode name="pikes01" votes="1" nodeid="1">
                        <fence>
                                <method name="1">
                                        <device name="iGrid" nodename="pikes01-mgmt" spname="pikes01-sp"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="pikes02" votes="1" nodeid="2">
                        <fence>
                                <method name="1">
                                        <device name="iGrid" nodename="pikes02-mgmt" spname="pikes02-sp"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="pikes03" votes="1" nodeid="3">
                        <fence>
                                <method name="1">
                                        <device name="iGrid" nodename="pikes03-mgmt" spname="pikes03-sp"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
Comment 1 Christine Caulfield 2005-12-07 03:56:46 EST
> the new cluster.conf gets overwritten by the earlier version and the join fails.

That's ccsd
Comment 2 Dennis Preston 2005-12-07 12:13:46 EST
Additional Info: 

If all nodes are rebooted, cluster behaves normally.
Comment 3 Jonathan Earl Brassow 2005-12-07 16:41:14 EST
cluster.conf should be updated on the existing nodes in the cluster.

Think of it this way:
1)  The other nodes don't know anything about the new node, and won't allow it into the cluster.
2)  The new node can not do a 'ccs_tool update' unless it is in a quorate cluster

You should receive a failure of some sort when trying to issue the 'ccs_tool update' if the node does not 
belong to a quorate cluster.
Comment 4 Dennis Preston 2005-12-07 16:47:21 EST
Sorry I was not clearer, The 3 node cluster.conf is copied to all 3 nodes, the 
quorate cluster as well as the new node. Then ccs_tool is run for insurance, 
then cman_tool is run to update the running cluster. The new node picks up a 
copy of the cluster.conf file other than the one copied to all nodes. It gets 
a downrev copy of the cluster.conf. cman_tool version reports the correct 
version number. Looking at the cluster.conf on the 3rd node shows a 2 node 
cluster.conf. The rejection is another symptom of the 3 node cluster.conf not 
being active, but being reported as active.
Comment 5 Jonathan Earl Brassow 2005-12-07 16:52:46 EST
Also, while it is possible to:

# vi /etc/cluster/cluster.conf
# ccs_tool update /etc/cluster/cluster.conf

I would:

# cp /etc/cluster/cluster.conf foo.conf
# vi foo.conf
# ccs_tool update foo.conf
Comment 6 Dennis Preston 2005-12-07 16:59:20 EST
Problem is that the new improved cluster.conf is not really running. It 
reports as running in error. The correct file is there until ccsd and cman  
start, Then it gets overwritten Here's the full script. Syntax is  
   scale_nodes add <mgmt port IP> <sp port IP>

#!/bin/bash
# ----------------------------------------------------------------------
#
# scale_nodes	scales the number of nodes in the iGrid system
#
# description:  Changes the iGrid configuration to add or retire nodes
#				from the iGrid system.  Commands a node to 
join or leave
#				the iGrid system
# ----------------------------------------------------------------------


# ------------- input variables  -----------------------------------------

ACTION=$1;
NODE=$2;
SPADDR=$3;

SCRIPTNAME=scale_nodes
VERSION="1.2.2.20"

CFGFILE=/etc/cluster/cluster.conf;
CFGBACK=/etc/cluster/cluster.bak;
TMPCFG=/tmp/cluster.tmp;
TMP2CFG=/tmp/cluster.tmp2;
TMPCONF=/tmp/cluster.conf;
JUNK=/tmp/junk;
JUNK2=/tmp/junk2;
JUNK3=/tmp/junk3;
JUNK4=/tmp/junk4;
HOSTSFILE=/etc/hosts
RHOSTSFILE=/root/.rhosts
TEMPLATE=/opt/crosswalk/cluster.conf.template
HOSTNAME=`hostname`
GRIDNAME=${HOSTNAME%??}

# ------------- the script itself --------------------------------------
#
# edit_config
#
edit_config()
{
	num_nodes=$1
	gridname=$2
	echo "EDITING $gridname with $num_nodes NODES"
	sed "s/cluster_name/$gridname/g" $TEMPLATE > $TMPCFG
	sed "13 w $JUNK3" $TMPCFG
	sed "13 d" $TMPCFG > $TMP2CFG
	sed "5,11 w $JUNK" $TMP2CFG
	sed "19 w $JUNK2" $TMP2CFG
	sed "5,11 s/xx/01/g" $TMP2CFG > $TMPCONF
	sed "5 s/yy/1/" $TMPCONF > $TMPCFG
	sed "19 s/xx/01/" $TMPCFG > $TMP2CFG
	sed "19 s/zz/1/" $TMP2CFG > $TMPCONF
	start_line=5
	end_line=11
	end_line2=19
	fail_start=19
	fail_end=21
	for ((x=2; x<=$num_nodes; x++))
		do
			sed "$end_line r $JUNK" $TMPCONF > $TMPCFG
			let "start_line = $start_line + 7"
					let "end_line = $start_line + 6"
			let "end_line2 = $end_line2 + 7"
			let "fail_start = $fail_start + 7"
					let "fail_end = $fail_end + 8"
			sed "$end_line2 r $JUNK2" $TMPCFG > $TMP2CFG
					mv -f $TMP2CFG $TMPCFG
			let "end_line2 = $end_line2 + 1"
			if [ "$x" -lt "10" ] ; then
				sed "$start_line,$end_line s/xx/0$x/g" $TMPCFG 
> $TMP2CFG
				sed "$end_line2 s/xx/0$x/" $TMP2CFG > $TMPCFG
			else
		 		sed "$start_line,$end_line s/xx/$x/g" $TMPCFG 
> $TMP2CFG
				sed "$end_line2 s/xx/$x/" $TMP2CFG > $TMPCFG
			fi
			sed "$start_line s/yy/$x/" $TMPCFG > $TMP2CFG
			sed "$end_line2 s/zz/$x/" $TMP2CFG > $TMPCONF
		done
	if [ "$num_nodes" -eq "2" ] ; then
		sed "19 r $JUNK3" $TMPCONF > $TMPCFG
		mv $TMPCFG $TMPCONF
	else
		start_line=20
		let "number_nodes=$num_nodes - 2" 
		let "end_line=7 * $number_nodes"
		let "end_line2=$start_line + $end_line"
		sed "$end_line2 i \        <cman/>" $TMPCONF > $TMPCFG
		mv $TMPCFG $TMPCONF
	fi

	sed "$fail_start,$fail_end w $JUNK4" $TMPCONF
	for ((z=2; z<=$num_nodes; z++))
		do
			sed "$fail_end r $JUNK4" $TMPCONF > $TMP2CFG
			let "fail_line = $fail_end + 1"
			sed "$fail_line s/igridnodes1/igridnodes$z/" $TMP2CFG 
> $TMPCFG
			let "fail_line = $fail_line + 1"
			for ((x=1; x<=$num_nodes; x++))
        			do
					let "y = $x + ($z - 1)"
					if [ "$y" -gt "$num_nodes" ] ; then
						let "y = $y - $num_nodes"
					fi
                	if [ "$x" -lt "10" ] && [ "$y" -lt "10" ] ; then
                       	sed "$fail_line s/${gridname}0$x/${gridname}0$y/" 
$TMPCFG > $TMP2CFG
                	else 
						if [ "$x" -ge "10" ]  && 
[ "$y" -ge "10" ] ; then
                       		sed "$fail_line s/${gridname}$x/${gridname}
$y/" $TMPCFG > $TMP2CFG
						else
							if [ "$x" -lt "10" ]  
&& [ "$y" -ge "10" ] ; then
							
	sed "$fail_line s/${gridname}0$x/${gridname}$y/" $TMPCFG > $TMP2CFG
							else
							
	sed "$fail_line s/${gridname}$x/${gridname}0$y/" $TMPCFG > $TMP2CFG
	
							fi
						fi
                	fi
					let "fail_line = $fail_line + 1"
					mv $TMP2CFG $TMPCFG
        			done
			let "fail_end = $fail_end + $num_nodes + 2"
			mv $TMPCFG $TMPCONF
		done
}

add_node()
{
	num_nodes=$1

	echo "                </failoverdomains>" >> $TMPCONF
	echo "                <resources>" >> $TMPCONF
	
	num_vips=`cat $CFGFILE | awk '/monitor_link/{print "1"}' | wc -l`
	let "old_num_nodes = $num_nodes - 1"
	let "start_line = 14 + (7 * $old_num_nodes) + ((2 + $old_num_nodes) * 
$old_num_nodes)"
	let "end_line = $start_line + (3 * $num_vips) + $num_vips"
	sed "$start_line,$end_line w $JUNK" $CFGFILE
	
	let "fail_end = $fail_end + 2"
	sed "$fail_end r $JUNK" $TMPCONF > $TMPCFG
	mv $TMPCFG $TMPCONF
			
	echo "		<service domain=\"igridnodes1\" name=\"snapshot\">" >> 
$TMPCONF
	echo "			<script name=\"snapshot\" 
file=\"/mnt/crosswalk/snapshot/run_snap\"/>" >> $TMPCONF
	echo "		</service>" >> $TMPCONF
	
	echo "		<service domain=\"igridnodes2\" 
name=\"email_notifier\">" >> $TMPCONF
	echo "			<script name=\"email_notifier\" 
file=\"/mnt/crosswalk/email_notifier/run_email_notifier\"/>" >> $TMPCONF
	echo "		</service>" >> $TMPCONF
	
	nvak=0
	nvbak=`cat $CFGFILE | awk '/nvmain/{print "1"}'`
	if [ "$nvbak" -eq "1" ] ; then 
		echo "		<service domain=\"igridnodes1\" 
name=\"nvbackup\">" >> $TMPCONF
		echo "		        <script name=\"nvbackup\" 
file=\"/mnt/igridbackup/nvbackup/nvmain\"/>" >> $TMPCONF
		echo "		</service>" >> $TMPCONF
	fi
	echo "	</rm>" >> $TMPCONF
	echo "</cluster>" >> $TMPCONF
		
	rm -f /tmp/junk*
	rm -f /tmp/cluster.tmp*
}

del_node()
{
        num_nodes=$1
                                                                               
                                                                               
              
        echo "                </failoverdomains>" >> $TMPCONF
        echo "                <resources>" >> $TMPCONF
                                                                               
                                                                               
              
        num_vips=`cat $CFGFILE | awk '/monitor_link/{print "1"}' | wc -l`
        let "old_num_nodes = $num_nodes + 1"
        let "start_line = 14 + (7 * $old_num_nodes) + ((2 + $old_num_nodes) * 
$old_num_nodes)"
        let "end_line = $start_line + (3 * $num_vips) + $num_vips"
        sed "$start_line,$end_line w $JUNK" $CFGFILE
                                                                               
                                                                               
              
        let "fail_end = $fail_end + 2"
        sed "$fail_end r $JUNK" $TMPCONF > $TMPCFG
        mv $TMPCFG $TMPCONF
                                                                               
                                                                               
              
        echo "      <service domain=\"igridnodes1\" name=\"snapshot\">" >> 
$TMPCONF
        echo "			<script name=\"snapshot\" 
file=\"/mnt/crosswalk/snapshot/run_snap\"/>" >> $TMPCONF
        echo "		</service>" >> $TMPCONF
        
        echo "		<service domain=\"igridnodes2\" 
name=\"email_notifier\">" >> $TMPCONF
		echo "			<script name=\"email_notifier\" 
file=\"/mnt/crosswalk/email_notifier/run_email_notifier\"/>" >> $TMPCONF
		echo "		</service>" >> $TMPCONF
                                                                               
                                                                               
              
        nvak=0
        nvbak=`cat $CFGFILE | awk '/nvmain/{print "1"}'`
        if [ "$nvbak" -eq "1" ] ; then
                echo "          <service domain=\"igridnodes1\" 
name=\"nvbackup\">" >> $TMPCONF
                echo "                  <script name=\"nvbackup\" 
file=\"/mnt/igridbackup/nvbackup/nvmain\"/>" >> $TMPCONF
                echo "          </service>" >> $TMPCONF
        fi
        echo "  </rm>" >> $TMPCONF
        echo "</cluster>" >> $TMPCONF
                                                                               
                                                                               
              
        rm -f /tmp/junk*
        rm -f /tmp/cluster.tmp*

#	node_cnt=`echo "$NODE" | awk -f '{x = length()} {y = x - 1} {z = substr
($0,y,2)} {print z}'`
#	let "shift = $old_num_nodes - $node_cnt"
#	for ((z=0; z<$shift; z++))
#		do
#			echo "TBD"
#		done
}


add_hosts()
{
	num_nodes=$1
    gridname=$2
	
	let "last_num = $num_nodes - 1"

	if [ "$last_num" -lt "10" ] ; then
		last_node=${gridname}0$last_num
	else
		last_node=${gridname}$last_num
	fi

	last_hbaddr=`cat $HOSTSFILE | awk -v lastnode=$last_node '{if ($2 == 
lastnode) print $1}'`

	let x=0
        for quad in ${last_hbaddr//./\ }; do
                ip1[((x++))]=$quad
        done

	let "ip1[3] = ${ip1[3]} + 1"

	if [ "$num_nodes" -lt "10" ] ; then
		new_node=${gridname}0$num_nodes
	else
		new_node=${gridname}$num_nodes
	fi
	
	echo "${ip1[0]}.${ip1[1]}.${ip1[2]}.${ip1[3]}		$new_node" >> 
$HOSTSFILE
	echo "$NODE		${new_node}-mgmt" >> $HOSTSFILE
	echo "$SPADDR		${new_node}-sp" >> $HOSTSFILE
	echo "$new_node" >> $RHOSTSFILE
	echo "${new_node}-mgmt" >> $RHOSTSFILE
	echo "${new_node}-sp" >> $RHOSTSFILE
}

del_hosts()
{
	sed "/$NODE/d" $HOSTSFILE > $JUNK
	mv $JUNK $HOSTSFILE
	sed "/$NODE/d" $RHOSTSFILE > $JUNK
	mv $JUNK $RHOSTSFILE
	
}


# See how we were called.

case "$1" in
  add)
	cp $CFGFILE $CFGBACK
	nodenum=`cat $CFGFILE | awk '/nodeid/{print "1"}' | wc -l`
	let "nodenum = $nodenum + 1"
	grid_name=`cat $CFGFILE | awk '/config_version/{x = length($3)} {y = 
x - 8} {z = substr($3, 7, y)} {print z}'`
	
	edit_config $nodenum $grid_name 
	add_node $nodenum
	add_hosts $nodenum $grid_name
	
	ccs_ver_tmp=`cat $CFGFILE | awk '/config_version/{print $2}' | awk '{z 
= substr($0,17,3)} {print z}'`
	ccs_ver=${ccs_ver_tmp%?}	
	let "new_ccs_ver = $ccs_ver + 1"
	sed "2 s/config_version=\"1\"/config_version=\"$new_ccs_ver\"/" 
$TMPCONF > $CFGFILE
	rm -f $TMPCONF
	
	thishost=`/bin/hostname`
	mynum=${thishost:(-2):2}
	for ((x=1; x<=$nodenum; x++))
	do
		if [ "$x" -ne "$mynum" ]; then
			if [ "$x" -lt "10" ] ; then
				/usr/bin/scp $CFGFILE ${gridname}0
${x}:/etc/cluster
				/usr/bin/scp $HOSTSFILE ${gridname}0${x}:/etc
				/usr/bin/scp $RHOSTSFILE ${gridname}0${x}:/root
			else
				/usr/bin/scp $CFGFILE ${gridname}
${x}:/etc/cluster
				/usr/bin/scp $HOSTSFILE ${gridname}${x}:/etc
				/usr/bin/scp $RHOSTSFILE ${gridname}${x}:/root
			fi
		fi
	done
		
	/sbin/cman_tool version -r $new_ccs_ver

	if [ "$x" -lt "10" ] ; then
		/usr/bin/ssh ${gridname}0${nodenum} /sbin/service igrid start
	else
		/usr/bin/ssh ${gridname}${nodenum} /sbin/service igrid start
	fi
    if [ -f /etc/samba/smb.conf ]
    then
      line=`cat /etc/hosts|grep ${NODE}`
      set $line
      HOST=`echo $line|cut -d " " -f2`
      THISBIOSNAME=`hostname | tr [a-z] [A-Z]`
      BIOSNAME=`echo $HOST | tr [a-z] [A-Z]`
      sed -e "s/netbios name = $THISBIOSNAME/netbios name = $BIOSNAME"/ \
          -e "s/server string =$HOSTNAME Samba Server/server string =$HOST 
Samba Server"/ \
      /etc/samba/smb.conf > /etc/samba/smb.tmp
      $RCP /etc/samba/smb.tmp $HOST:/etc/samba/smb.conf
    fi
    if [ -f /etc/ldap.conf ]
    then
       $RCP /etc/ldap.conf $HOST:/etc/ldap.conf
    fi
    if [ -f /etc/openldap/ldap.conf ]
    then
       $RCP /etc/openldap/ldap.conf $HOST:/etc/openldap/ldap.conf
    fi
    if [ -f /etc/krb5.conf ]
    then
       $RCP /etc/krb5.conf $HOST:/etc/krb5.conf
    fi
    if [ -f /var/kerberos/krb5kdc ]
    then
       $RCP /var/kerberos/krb5kdc $HOST:/var/kerberos/krb5kdc
    fi
	;;

  delete)
	ssh $NODE /sbin/service igrid stop
	cp $CFGFILE $CFGBACK
	nodenum=`cat $CFGFILE | awk '/nodeid/{print "1"}' | wc -l`
    let "nodenum = $nodenum - 1"
    grid_name=`cat $CFGFILE | awk '/config_version/{x = length($3)} {y = x - 
8} {z = substr($3, 7, y)} {print z}'`

	edit_config $nodenum $grid_name 
	del_node $nodenum
	del_hosts 
	
	ccs_ver_tmp=`cat $CFGFILE | awk '/config_version/{print $2}' | awk '{z 
= substr($0,17,3)} {print z}'`
    ccs_ver=${ccs_ver_tmp%?}
    let "new_ccs_ver = $ccs_ver + 1"
    sed "2 s/config_version=\"1\"/config_version=\"$new_ccs_ver\"/" $TMPCONF > 
$CFGFILE
    rm -f $TMPCONF

	thishost=`/bin/hostname`
	mynum=${thishost:(-2):2}
	for ((x=1; x<=$nodenum; x++))
	do
		if [ "$x" -ne "$mynum" ]; then
			if [ "$x" -lt "10" ] ; then
				/usr/bin/scp $CFGFILE ${gridname}0
${x}:/etc/cluster
				/usr/bin/scp $HOSTSFILE ${gridname}0${x}:/etc
				/usr/bin/scp $RHOSTSFILE ${gridname}0${x}:/root
			else
				/usr/bin/scp $CFGFILE ${gridname}
${x}:/etc/cluster
				/usr/bin/scp $HOSTSFILE ${gridname}${x}:/etc
				/usr/bin/scp $RHOSTSFILE ${gridname}${x}:/root
			fi
		fi
	done

    sed -e "/$NODE/d" /mnt/crosswalk/authorized_keys 
> /mnt/crosswalk/authorized_keys.tmp
    cp /mnt/crosswalk/authorized_keys.tmp /mnt/crosswalk/authorized_keys
    rm -rf /mnt/crosswalk/authorized_keys.tmp

    num_nodes=`cat /etc/cluster/cluster.conf|grep "<clusternode "|wc -l`
    line=`cat /etc/hosts|grep ${NODE}-mgmt`
    set $line
    NODEIP=`echo $line|cut -d " " -f1`

    for ((x=1; x<=$num_nodes; x++))
    do
       if [ "$x" -lt "10" ]
       then
          su cwsupport -c "ssh ${GRIDNAME}0${x}-sp access delete trust $NODEIP"
       else
          su cwsupport -c "ssh ${GRIDNAME}${x}-sp access delete trust $NODEIP"
       fi
    done

	/sbin/cman_tool version -r $new_ccs_ver
	;;

  join)
	ssh $NODE /sbin/service igrid start
    ssh $NODE /opt/crosswalk/bin/relocate_keys 
join                                                                           
                                                                            
	;;

  leave)
    ssh $NODE /opt/crosswalk/bin/relocate_keys leave
	ssh $NODE /sbin/service igrid stop
	;;
	
  -v)
	echo "$SCRIPTNAME Version $VERSION"
	;;
	
  -V)
	echo "$SCRIPTNAME Version $VERSION"
	;;

  *)
	echo $"Usage: $0 {add|delete|join|leave|-v}"
	;;
esac

exit 0
Comment 7 Jonathan Earl Brassow 2005-12-07 17:02:17 EST
Try:

1) update current cluster
2) start/add new node

If you start ccsd on the new node (either explicitly via 'ccsd' or implicitly via the init script) before the 
quorate cluster is updated, the new node will grab the current working version of the cluster.conf file.  
The current working version is not the one on disk, but rather the one in memory.  'ccs_tool update' will 
push a proposed cluster.conf file to the nodes in the cluster - including updating their in-memory 
copy.

I'll try to look over the above script and see what is going on.
Comment 8 Dennis Preston 2005-12-07 17:07:34 EST
Source controlled version of script is 1 day out of date. The line that reads
	/sbin/cman_tool version -r $new_ccs_ver

should be
	/sbin/ccs_tool update /etc/cluster/cluster.conf
	/sbin/cman_tool version -r $new_ccs_ver

Copy on the test system is the bottom version.

We are updating the running cluster with ccs_tool then cman_tool. That's why a 
bug. Copy from memory is trashing the new one even after updates are run.
Comment 9 Dennis Preston 2005-12-09 11:47:21 EST
Test reasults from 12/09/05:

Start condition: 2 node cluster, quorate (all active nodes must be rebooted to 
set this condition)

 

[root@pikes01 ~]# clustat

Member Status: Quorate, Group Member

 

  Member Name                              State      ID

  ------ ----                              -----      --

  pikes02                                  Online     0x0000000000000002

  pikes01                                  Online     0x0000000000000001

 

  Service Name         Owner (Last)                   State

  ------- ----         ----- ------                   -----

  pikes-mgmt           pikes01                        started

  cifs1                pikes01                        started

  cifs2                pikes02                        started

  10.251.0.96          pikes01                        started

  10.251.0.97          pikes02                        started

  10.251.0.98          pikes01                        started

  snapshot             pikes01                        started

  email_notifier       pikes02                        started

  nvbackup             pikes02                        started

 

[root@pikes01 ~]# cman_tool status

Protocol version: 5.0.1

Config version: 9

Cluster name: pikes

Cluster ID: 3377

Cluster Member: Yes

Membership state: Cluster-Member

Nodes: 2

Expected_votes: 1

Total_votes: 2

Quorum: 1

Active subsystems: 32

Node name: pikes01

Node addresses: 10.10.10.111

 

Ran the update script (does a ccs_tool update update.conf)

 

[root@pikes01 bin]# clustat

Member Status: Quorate, Group Member

 

  Member Name                              State      ID

  ------ ----                              -----      --

  pikes02                                  Online     0x0000000000000002

  pikes01                                  Online     0x0000000000000001

 

  Service Name         Owner (Last)                   State

  ------- ----         ----- ------                   -----

  pikes-mgmt           pikes01                        started

  cifs1                pikes01                        started

  cifs2                pikes02                        started

  10.251.0.96          pikes01                        started

  10.251.0.97          pikes02                        started

  10.251.0.98          pikes01                        started

  snapshot             pikes01                        started

  email_notifier       pikes02                        started

  nvbackup             (pikes02                     ) stopped

 

[root@pikes01 bin]# cman_tool version

5.0.1 config 10

[root@pikes01 bin]# cman_tool status

Protocol version: 5.0.1

Config version: 10

Cluster name: pikes

Cluster ID: 3377

Cluster Member: Yes

Membership state: Cluster-Member

Nodes: 2

Expected_votes: 1

Total_votes: 2

Quorum: 1

Active subsystems: 32

Node name: pikes01

Node addresses: 10.10.10.111

 

[root@pikes01 cluster]# ls -al

total 44

drwxr-xr-x    2 root root  4096 Dec  9 08:23 .

drwxr-xr-x  104 root root 12288 Dec  9 08:09 ..

-rw-r-----    1 root root  3554 Dec  9 08:23 .cluster.conf

-rw-r-----    1 root root  3554 Dec  9 08:23 cluster.conf

-rw-r--r--    1 root root  3554 Dec  9 08:22 update.conf

 

Cluster.conf and update.conf configed for 3 node cluster on all 3 nodes.

 

clustat from 3rd node:

 

[root@pikes03 cluster]# clustat

Could not connect to cluster service

[root@pikes03 cluster]#

 

This is as expected. Boot the 3rd node into the cluster

 

[root@pikes03 cluster]# chkconfig --level 345 igrid on

[root@pikes03 cluster]# reboot

 

Console output from new node:

 

Starting igrid:  cwalk_init: enter

qla2x00: cwalk_open: opening module, cnt=1

CMAN <CVS> (built Nov 14 2005 11:02:37) installed

CMAN: Cluster membership rejected

CMAN: Cluster membership rejected

CMAN: Cluster membership rejected

CMAN: Cluster membership rejected

CMAN: Cluster membership rejected

CMAN: Cluster membership rejected

CMAN: Cluster membership rejected

CMAN: Cluster membership rejected

 

Console output from existing node #2 – That is the node reporting rejection

 

[root@pikes02 cluster]# CMAN: Join request from pikes03 rejected, exceeds two 
node limit

CMAN: Join request from pikes03 rejected, exceeds two node limit

CMAN: Join request from pikes03 rejected, exceeds two node limit

CMAN: Join request from pikes03 rejected, exceeds two node limit

CMAN: Join request from pikes03 rejected, exceeds two node limit

CMAN: Join request from pikes03 rejected, exceeds two node limit

CMAN: Join request from pikes03 rejected, exceeds two node limit

CMAN: Join request from pikes03 rejected, exceeds two node limit

 

Reboot the new node again, just for grins, same result.

 

Reboot node 1,  it rejoins the cluster, Note the expected votes has been 
incremented

 

[root@pikes01 ~]# clustat

Member Status: Quorate, Group Member

 

  Member Name                              State      ID

  ------ ----                              -----      --

  pikes02                                  Online     0x0000000000000002

  pikes01                                  Online     0x0000000000000001

 

  Service Name         Owner (Last)                   State

  ------- ----         ----- ------                   -----

  pikes-mgmt           pikes01                        started

  cifs1                pikes01                        started

  cifs2                pikes02                        started

  10.251.0.96          pikes01                        started

  10.251.0.97          pikes02                        started

  10.251.0.98          pikes01                        started

  snapshot             pikes01                        started

  email_notifier       pikes02                        started

  nvbackup             pikes01                        starting

 

[root@pikes01 ~]# cman_tool status

Protocol version: 5.0.1

Config version: 10

Cluster name: pikes

Cluster ID: 3377

Cluster Member: Yes

Membership state: Cluster-Member

Nodes: 2

Expected_votes: 3

Total_votes: 2

Quorum: 1

Active subsystems: 32

Node name: pikes01

Node addresses: 10.10.10.111

 

Status from node2: (Also incremented after node 1 reboot)

 

[root@pikes02 cluster]# cman_tool status

Protocol version: 5.0.1

Config version: 10

Cluster name: pikes

Cluster ID: 3377

Cluster Member: Yes

Membership state: Cluster-Member

Nodes: 2

Expected_votes: 3

Total_votes: 2

Quorum: 1

Active subsystems: 32

Node name: pikes02

Node addresses: 10.10.10.112

 

Reboot node 3: Now it gets strange

 

Node 3 console:

Starting igrid:  cwalk_init: enter

qla2x00: cwalk_open: opening module, cnt=1

CMAN <CVS> (built Nov 14 2005 11:02:37) installed

CMAN: Cluster membership rejected

CMAN: quorum regained, resuming activity

DLM <CVS> (built Nov 14 2005 11:02:43) installed

Lock_Harness <CVS> (built Nov 14 2005 10:52:54) installed

GFS <CVS> (built Nov 14 2005 10:52:37) installed

GFS: Trying to join cluster "lock_dlm", "pikes:crosswalk"

Lock_DLM (built Nov 14 2005 10:52:44) installed

GFS: fsid=pikes:crosswalk.2: Joined cluster. Now mounting FS...

GFS: fsid=pikes:crosswalk.2: jid=2: Trying to acquire journal lock...

GFS: fsid=pikes:crosswalk.2: jid=2: Looking at journal...

GFS: fsid=pikes:crosswalk.2: jid=2: Done

GFS: Trying to join cluster "lock_dlm", "pikes:nvbackup"

GFS: fsid=pikes:nvbackup.2: Joined cluster. Now mounting FS...

GFS: fsid=pikes:nvbackup.2: jid=2: Trying to acquire journal lock...

GFS: fsid=pikes:nvbackup.2: jid=2: Looking at journal...

GFS: fsid=pikes:nvbackup.2: jid=2: Done

GFS: Trying to join cluster "lock_dlm", "pikes:c4"

GFS: fsid=pikes:c4.2: Joined cluster. Now mounting FS...

GFS: fsid=pikes:c4.2: jid=2: Trying to acquire journal lock...

GFS: fsid=pikes:c4.2: jid=2: Looking at journal...

GFS: fsid=pikes:c4.2: jid=2: Done

GFS: Trying to join cluster "lock_dlm", "pikes:n1"

GFS: fsid=pikes:n1.2: Joined cluster. Now mounting FS...

GFS: fsid=pikes:n1.2: jid=2: Trying to acquire journal lock...

GFS: fsid=pikes:n1.2: jid=2: Looking at journal...

GFS: fsid=pikes:n1.2: jid=2: Done

GFS: fsid=pikes:n1.2: Scanning for log elements...

GFS: fsid=pikes:n1.2: Found 0 unlinked inodes

GFS: fsid=pikes:n1.2: Found quota changes for 0 IDs

GFS: fsid=pikes:n1.2: Done

GFS: Trying to join cluster "lock_dlm", "pikes:n2"

GFS: fsid=pikes:n2.2: Joined cluster. Now mounting FS...

GFS: fsid=pikes:n2.2: jid=2: Trying to acquire journal lock...

GFS: fsid=pikes:n2.2: jid=2: Looking at journal...

GFS: fsid=pikes:n2.2: jid=2: Done

GFS: Trying to join cluster "lock_dlm", "pikes:n3"

GFS: fsid=pikes:n3.2: Joined cluster. Now mounting FS...

GFS: fsid=pikes:n3.2: jid=2: Trying to acquire journal lock...

GFS: fsid=pikes:n3.2: jid=2: Looking at journal...

GFS: fsid=pikes:n3.2: jid=2: Done

GFS: Trying to join cluster "lock_dlm", "pikes:n4"

GFS: fsid=pikes:n4.2: Joined cluster. Now mounting FS...

GFS: fsid=pikes:n4.2: jid=2: Trying to acquire journal lock...

GFS: fsid=pikes:n4.2: jid=2: Looking at journal...

GFS: fsid=pikes:n4.2: jid=2: Done

GFS: Trying to join cluster "lock_dlm", "pikes:n5"

GFS: fsid=pikes:n5.2: Joined cluster. Now mounting FS...

GFS: fsid=pikes:n5.2: jid=2: Trying to acquire journal lock...

GFS: fsid=pikes:n5.2: jid=2: Looking at journal...

GFS: fsid=pikes:n5.2: jid=2: Done

GFS: Trying to join cluster "lock_dlm", "pikes:c3"

GFS: fsid=pikes:c3.2: Joined cluster. Now mounting FS...

GFS: fsid=pikes:c3.2: jid=2: Trying to acquire journal lock...

GFS: fsid=pikes:c3.2: jid=2: Looking at journal...

GFS: fsid=pikes:c3.2: jid=2: Done

GFS: fsid=pikes:c3.2: Scanning for log elements...

GFS: fsid=pikes:c3.2: Found 0 unlinked inodes

GFS: fsid=pikes:c3.2: Found quota changes for 0 IDs

GFS: fsid=pikes:c3.2: Done

[  OK  ]

Red Hat Enterprise Linux AS release 4 (Nahant Update 2)

Kernel 2.6.9-22.EL_1-2-2-2_dgs_smp on an x86_64

 

 

Console Login:

 

 

Node 2 console:

 

[root@pikes02 cluster]# CMAN: Join request from pikes03 rejected, exceeds two 
node limit

CMAN: got WAIT barrier not in phase 1 sm.33554442.3.10.3 (2)

 

No output to node 1 console

 

Node 3 has now joined the cluster, but will get rejected by node 2 until it is 
rebooted.

 

Console Login: root

Password:

Last login: Fri Dec  9 09:02:08 on ttyS0

[root@pikes03 ~]# clustat

Member Status: Quorate, Group Member

 

  Member Name                              State      ID

  ------ ----                              -----      --

  pikes01                                  Online     0x0000000000000001

  pikes02                                  Online     0x0000000000000002

  pikes03                                  Online     0x0000000000000003

 

  Service Name         Owner (Last)                   State

  ------- ----         ----- ------                   -----

  pikes-mgmt           pikes01                        started

  cifs1                pikes01                        started

  cifs2                pikes02                        started

  10.251.0.96          pikes01                        started

  10.251.0.97          pikes02                        started

  10.251.0.98          pikes01                        started

  snapshot             pikes01                        started

  email_notifier       pikes02                        started

  nvbackup             pikes01                        started

 

[root@pikes03 ~]# cman_tool status

Protocol version: 5.0.1

Config version: 10

Cluster name: pikes

Cluster ID: 3377

Cluster Member: Yes

Membership state: Cluster-Member

Nodes: 3

Expected_votes: 3

Total_votes: 3

Quorum: 2

Active subsystems: 32

Node name: pikes03

Node addresses: 10.10.10.113

 

Once node 2 is rebooted, normal cluster operations can resume.

 
Comment 10 Jonathan Earl Brassow 2005-12-13 14:32:59 EST
Looks like the config file is being propagated... original issue solved?

The fact that the third node can not automatically join is a seperate, known
issue.  CMAN operates in a special (two-node) mode when there are only two
nodes.  This can not be changed unless the cluster is brought down and back up.
 Adding subsequent nodes after 3 does not have this problem.

Comment 11 Jim Parsons 2005-12-13 15:25:14 EST
In addition, the tags and attrs necessary to tell cman to use two_node mode are
not present in the original conf file way, way, up above. Finally, there is no
<fencedevices> section. I am pretty sure fenced will just go to the beach
without any fence devices declared.

Um, I know that Crosswalk uses their own UI, but it may be worth the effort to
build the cluster.conf file for this system offline using s-c-cluster, and then
compare the resulting XML files. In the cluster.conf file above, there is not
even a closing </cluster> tag...dunno, maybe it was snipped off when pasting to
this ticket.
Comment 12 Dennis Preston 2005-12-13 15:43:03 EST
Looks like a very bad paste. I got the 3 to 4 node add working by creating 2 
copies of the new cluster.conf. cluster.conf and update.conf, identical files. 
I then run ccs_tool and cman_tool and join the 4th node. Works well. I am 
going to try this method on the 2 node case and observe.

This whole section may be missing from the file I sent:

        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_cwigrid" name="iGrid"/>
        </fencedevices>
.. all the way down
        </rm>
</cluster>
Comment 13 Jonathan Earl Brassow 2006-02-15 12:38:27 EST
If you are mucking around copying /etc/cluster/cluster.conf, it is my guess that
you are attacking the problem incorrectly.

Is this issue still a problem for you?

Note You need to log in before you can comment on or make changes to this bug.