Bug 612300 - conga will fail to start a new node that is added via luci
conga will fail to start a new node that is added via luci
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: conga (Show other bugs)
5.5
All Linux
low Severity medium
: rc
: ---
Assigned To: Ryan McCabe
Cluster QE
:
: 629152 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-07 15:50 EDT by Shane Bradley
Modified: 2011-01-13 17:29 EST (History)
4 users (show)

See Also:
Fixed In Version: conga-0.12.2-16.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-01-13 17:29:24 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to fix bug (3.66 KB, patch)
2010-08-06 16:16 EDT, Ryan McCabe
no flags Details | Diff

  None (edit)
Description Shane Bradley 2010-07-07 15:50:23 EDT
Description of problem:

The install error that luci reports: "A problem occurred when starting
this node: service cman start failed: "

When a new node is added in current implementation of Luci/Ricci a
skeleton cluster.conf is added with just the clustername tag
updated. The adding of the node relies on "ccsd" to find the
clustername in the network. If for some the cluster(via clustername
that is in skeleton cluster.conf) is not found then starting of cman
will fail because cluster.conf does not have any nodes defined to
create a new cluster since it could not find the orginal cluster.

The name of the cluster that this node is trying to join is:
"zylog_cluster2" which is the name that is in the skeleton
cluster.conf.

$  cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="1" name="zylog_cluster2">
       <fence_daemon post_fail_delay="0" post_join_delay="3"/>
       <clusternodes/>
       <cman/>
       <fencedevices/>
       <rm/>
</cluster>

Here is log for the 3rd node when it was added to cluster.
$ tail -f /var/log/messages
Jul  7 14:37:25 clusternode3 ricci: startup succeeded
Jul  7 14:42:59 clusternode3 ccsd[9097]: Starting ccsd 2.0.115:
Jul  7 14:42:59 clusternode3 ccsd[9097]:  Built: May 25 2010 04:32:01
Jul  7 14:42:59 clusternode3 ccsd[9097]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved.
Jul  7 14:42:59 clusternode3 ccsd[9097]: cluster.conf (cluster name = zylog_cluster2, version = 1) found.
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6'
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] AIS Executive Service: started and ready to provide service.
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] local node name "clusternode3" not found in cluster.conf
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] Error reading CCS info, cannot start
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] Error reading config from CCS
Jul  7 14:43:01 clusternode3 openais[9103]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration file).
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6'
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] AIS Executive Service: started and ready to provide service.
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] local node name "clusternode3" not found in cluster.conf
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] Error reading CCS info, cannot start
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] Error reading config from CCS
Jul  7 14:43:02 clusternode3 openais[9125]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration file).
Jul  7 14:43:28 clusternode3 ccsd[9097]: Unable to connect to cluster infrastructure after 30 seconds.
Jul  7 14:43:58 clusternode3 ccsd[9097]: Unable to connect to cluster infrastructure after 60 seconds.
Jul  7 14:44:28 clusternode3 ccsd[9097]: Unable to connect to cluster infrastructure after 90 seconds.
Jul  7 14:44:58 clusternode3 ccsd[9097]: Unable to connect to cluster infrastructure after 120 seconds.
Jul  7 14:45:28 clusternode3 ccsd[9097]: Unable to connect to cluster infrastructure after 150 seconds.
--------------------------------------------------------------------------------

What is happening is that "ccsd" is not able to find the cluster via
clustername when it broadcast. By default ccsd just searches for a
cluster in the network. 

Version-Release number of selected component (if applicable):
luci-0.12.2-12.el5
ricci-0.12.2-12.el5

How reproducible:
Does not reproduce on all clusters. 100% reproducable on clusterA, 0%
on clusterB.

Steps to Reproduce:
1. Create a cluster
2. Add existing cluster to luci 
3. Add a new node to cluster
  
Actual results:
The new node will not join the cluster since it cannot find the
cluster it was suppose to join.

Expected results:
The new node should join the cluster.

Additional info:
The solution to resolve this issue is a simple one. Instead of relying
on ccsd to find the correct cluster and get the updated cluster.conf,
just copy the cluster.conf that were sent to the original nodes to the
new nodes. Don't start ccsd on the new node till the updated
cluster.conf is copied to the new node.
Comment 1 Ryan McCabe 2010-08-06 16:16:25 EDT
Created attachment 437255 [details]
patch to fix bug
Comment 4 Ryan McCabe 2010-10-26 16:22:07 EDT
*** Bug 629152 has been marked as a duplicate of this bug. ***
Comment 8 errata-xmlrpc 2011-01-13 17:29:24 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0033.html

Note You need to log in before you can comment on or make changes to this bug.