Bug 612300

Summary: conga will fail to start a new node that is added via luci
Product: Red Hat Enterprise Linux 5 Reporter: Shane Bradley <sbradley>
Component: congaAssignee: Ryan McCabe <rmccabe>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: bbrock, cluster-maint, tao, wmealing
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: conga-0.12.2-16.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 22:29:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
patch to fix bug none

Description Shane Bradley 2010-07-07 19:50:23 UTC
Description of problem:

The install error that luci reports: "A problem occurred when starting
this node: service cman start failed: "

When a new node is added in current implementation of Luci/Ricci a
skeleton cluster.conf is added with just the clustername tag
updated. The adding of the node relies on "ccsd" to find the
clustername in the network. If for some the cluster(via clustername
that is in skeleton cluster.conf) is not found then starting of cman
will fail because cluster.conf does not have any nodes defined to
create a new cluster since it could not find the orginal cluster.

The name of the cluster that this node is trying to join is:
"zylog_cluster2" which is the name that is in the skeleton
cluster.conf.

$  cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="1" name="zylog_cluster2">
       <fence_daemon post_fail_delay="0" post_join_delay="3"/>
       <clusternodes/>
       <cman/>
       <fencedevices/>
       <rm/>
</cluster>

Here is log for the 3rd node when it was added to cluster.
$ tail -f /var/log/messages
Jul  7 14:37:25 clusternode3 ricci: startup succeeded
Jul  7 14:42:59 clusternode3 ccsd[9097]: Starting ccsd 2.0.115:
Jul  7 14:42:59 clusternode3 ccsd[9097]:  Built: May 25 2010 04:32:01
Jul  7 14:42:59 clusternode3 ccsd[9097]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved.
Jul  7 14:42:59 clusternode3 ccsd[9097]: cluster.conf (cluster name = zylog_cluster2, version = 1) found.
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6'
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] AIS Executive Service: started and ready to provide service.
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] local node name "clusternode3" not found in cluster.conf
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] Error reading CCS info, cannot start
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] Error reading config from CCS
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration file).
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6'
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] AIS Executive Service: started and ready to provide service.
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] local node name "clusternode3" not found in cluster.conf
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] Error reading CCS info, cannot start
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] Error reading config from CCS
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration file).
Jul  7 14:43:28 clusternode3 ccsd[9097]: Unable to connect to cluster infrastructure after 30 seconds.
Jul  7 14:43:58 clusternode3 ccsd[9097]: Unable to connect to cluster infrastructure after 60 seconds.
Jul  7 14:44:28 clusternode3 ccsd[9097]: Unable to connect to cluster infrastructure after 90 seconds.
Jul  7 14:44:58 clusternode3 ccsd[9097]: Unable to connect to cluster infrastructure after 120 seconds.
Jul  7 14:45:28 clusternode3 ccsd[9097]: Unable to connect to cluster infrastructure after 150 seconds.
--------------------------------------------------------------------------------

What is happening is that "ccsd" is not able to find the cluster via
clustername when it broadcast. By default ccsd just searches for a
cluster in the network. 

Version-Release number of selected component (if applicable):
luci-0.12.2-12.el5
ricci-0.12.2-12.el5

How reproducible:
Does not reproduce on all clusters. 100% reproducable on clusterA, 0%
on clusterB.

Steps to Reproduce:
1. Create a cluster
2. Add existing cluster to luci 
3. Add a new node to cluster
  
Actual results:
The new node will not join the cluster since it cannot find the
cluster it was suppose to join.

Expected results:
The new node should join the cluster.

Additional info:
The solution to resolve this issue is a simple one. Instead of relying
on ccsd to find the correct cluster and get the updated cluster.conf,
just copy the cluster.conf that were sent to the original nodes to the
new nodes. Don't start ccsd on the new node till the updated
cluster.conf is copied to the new node.

Comment 1 Ryan McCabe 2010-08-06 20:16:25 UTC
Created attachment 437255 [details]
patch to fix bug

Comment 4 Ryan McCabe 2010-10-26 20:22:07 UTC
*** Bug 629152 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2011-01-13 22:29:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0033.html