Bug 157094

Summary: No indication for result of config file propagation with "Send to Cluster" button
Product: [Retired] Red Hat Cluster Suite Reporter: Paul Kennedy <pkennedy>
Component: redhat-config-clusterAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: adstrong, cluster-maint, jha, jparsons
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2005-736 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-07 16:45:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 159783    

Description Paul Kennedy 2005-05-06 19:25:38 UTC
Description of problem:
Propagating an updated configuration file (clicking "Send to Cluster") does not
indicate the result of propagation.


Version-Release number of selected component (if applicable):
system-config-cluster 0.9.48

How reproducible:
After updating configuration file (/etc/cluster/cluster.conf), saving it, and
clicking "Send to Cluster", there is no confirmation that propagating the
revised configuration is successful or unsuccessful.

Steps to Reproduce:
---------------------------
For successful propagation
---------------------------
1.  Change the configuration file with a valid change--for example,
    add or delete a service.
2.  Save configuration file.
3.  Click "Send to Cluster".
4.  Observe that there is no confirmation of the successful propagation, as
    there is with 'ccs_tool update <xml file>'.

-----------------------------
For unsuccessful propagation
-----------------------------
1.  Change the configuration file with an invalid change--for example,
    changing the cluster name.
2.  Save configuration file.
3.  Click "Send to Cluster".
4.  Observe that there is no indication of an unsuccessful propagation as there
    is with 'ccs_tool update <xml file>'.  

Actual results:
See "Steps to Reproduce".

Expected results:
See "Steps to Reproduce".

Additional info:

Comment 1 Jim Parsons 2005-05-09 21:31:00 UTC
fixed in 0.9.51-1.0

Comment 2 Corey Marthaler 2005-05-11 21:49:22 UTC
three problems:

1. I get told that the file was successfully propagated to the cluster and it
wasn't because the one of the nodes was dead (in fact the GUI even showed it was
dead).

2. This new infomation window has YES/NO click boxes. Is it asking me to check
if it got propagated?

3. You spelled propagate wrong.

Comment 3 Jim Parsons 2005-05-20 17:46:01 UTC
This is all good to go now in 0.9.54-1.0

Comment 4 Jim Parsons 2005-05-23 18:13:21 UTC
*** Bug 158416 has been marked as a duplicate of this bug. ***

Comment 5 Corey Marthaler 2005-05-24 16:50:15 UTC
fix verified in -57.

Comment 6 Corey Marthaler 2005-05-31 22:16:20 UTC
I'm seeing this issue again, I'm told that the file was successfully propagated
to the cluster and it wasn't because the one of the nodes was dead. And when
this node attempts to rejoin, it fails due to the version number mismatch. 

I think that the GUI should still attempt to copy out to whoever is still up but
it also needs to let the user know who it wasn't able to copy to. 

Comment 7 Jim Parsons 2005-06-06 16:55:00 UTC
I may be wrong, but I do not believe that there is a way for the UI to get this
information. ccsd would have to be modified to return this information for
display. Reassigning to jbrassow for his comments and opinions.

Comment 8 Jonathan Earl Brassow 2005-06-06 17:26:42 UTC
the update is successful if everyone in the cluster has been updated.

If a node is down, it is not part of the cluster.

The behaviour that should be expected when a failed node comes back after an update is:
1) ccsd starts on that node
2) cman_tool join is issued
3) ccs looks to the other systems to ensure it has the latest copy and retrieves as necessary
4) cman_tool completes
5) cluster infrastructure should be ready

Looking at comment #6, the expected behaviour does not seem to be happening.  Can you reproduce 
this?

Comment 9 Corey Marthaler 2005-06-07 20:14:52 UTC
I was able to reproduce this. This only happens with the initscripts turned on.
Without them on (by hand) everything appears to work as outlined above.

I shot morph-01, made some changes, and propagated a few times with the GUI,
when coming back up with init, morph-01 ran into problems:

Starting ccsd:  [  OK  ]
Starting cman:CMAN 2.6.9-34.5 (built May 23 2005 11:49:47) installed
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
CMAN: Cluster membership rejected
[FAILED]
Starting lock_gulmd:[WARNING]
Starting fenced:  CMAN: Cluster membership rejected
Starting fence domain:[FAILED]
[FAILED]
Starting clvmd: clvmd could not connect to cluster manager
Consult syslog for more information
[FAILED]


morph-02:
Jun  7 10:52:47 morph-02 kernel: CMAN: removing node morph-01 from the cluster :
Missed too many heartbeats
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
Jun  7 10:54:52 morph-02 kernel: CMAN: Join request from morph-01 rejected,
config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
Jun  7 10:55:27 morph-02 last message repeated 7 times
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
Jun  7 10:56:32 morph-02 last message repeated 13 times
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
CMAN: Join request from morph-01 rejected, config version local 6 remote 3
Jun  7 10:56:52 morph-02 last message repeated 4 times


After all this, I then tried everything on morph-01 by hand, it then did work.

Comment 10 Jonathan Earl Brassow 2005-06-14 20:29:23 UTC
A key piece of evidence:

Jun 14 07:17:54 morph-04 rc: Starting ccsd:  succeeded
Jun 14 07:17:54 morph-04 kernel: CMAN 2.6.9-36.0 (built May 31 2005 12:15:02) in
stalled
Jun 14 07:17:54 morph-04 kernel: NET: Registered protocol family 30
Jun 14 07:17:54 morph-04 ccsd[1703]: cluster.conf (cluster name = morph-cluster,
 version = 1) found.
Jun 14 07:17:54 morph-04 ccsd[1703]: Unable to perform sendto: Cannot assign
requested address

broadcast for doc is bombing out on the error, thus failing to bring in the
latest copy.



Comment 11 Jonathan Earl Brassow 2005-06-15 01:32:58 UTC
- fix for bug 157094

A mysterious error being generated when trying to do a broadcast (sendto):
ccsd[1704]: Unable to perform sendto: Cannot assign requested address

On certain clusters (seems to be when ccs tries using IPv6), this error
could show up 9 out of 10 times.  When the error was received, the
broadcast attempt would fail.  This caused the attempt to grab any
possibly updated cluster.conf files to abort.

Waiting a moment, closing the socket, reopening the socket, and retrying
the broadcast seems to solve the issue.  (It has work 100+ times so far.)

I'm not entirely certain what is causing the initial try to fail - perhaps
the underlying subsystem is not quite ready...  In any case, I have never
seen a second attempt fail.

Holding off on RHEL4U1 branch commit for ok... all other branches committed.


Comment 12 Jonathan Earl Brassow 2005-06-15 13:58:38 UTC
The fix for this will not appear in RHEL4U1.  So, here's the workaround.

Option #1 (the "if it hurts, don't do it" option):
Don't do updates when you have nodes down that you plan on bringing back into
the cluster.

Option #2:
When a machine comes back up, it may not grab the latest copy of the CCS
information; therefore, it will not be able to join the cluster and cluster
services will fail to start.  Rebooting the node will solve the problem.

Option #3:
When a machine comes back up, it may not grab the latest copy of the CCS
information; therefore, it will not be able to join the cluster and cluster
services will fail to start.  Run 'cman_tool join; modprobe dlm' and then
'service start <fenced|clvmd|gfs|rgmanager>'

Comment 13 Corey Marthaler 2005-09-07 21:14:14 UTC
fix verified in 1.0.15

Comment 14 Red Hat Bugzilla 2005-10-07 16:45:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-736.html