Bug 133254

Summary: ccsd cluster.conf version drops 2 version levels after an update follwed by recovery
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: ccsAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED CURRENTRELEASE QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-01-11 00:06:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2004-09-22 19:08:35 UTC
Description of problem:
1. get all nodes in cluster to a particlular cluster.conf version
level greater than 2
2. then update one node
3. send sig -HUP on that node
4. then kill that node
5. verify the cluster.conf version levels on all other nodes, should
have gotten bumped.
6. bring killed node back up
7. check it's cluster.conf version, should be same as all others
8. start ccsd
9. check it's cluster.conf version, should be same as all others
10. attempt cman_tool join

You then get a:
CMAN: Cluster membership rejected

11. check it's cluster.conf version, it goes down exactly 2 version
levels.

CRAZY! :)

How reproducible:
Always

Comment 1 Kiersten (Kerri) Anderson 2004-11-04 15:08:13 UTC
Updates with the proper version and component name.

Comment 2 Kiersten (Kerri) Anderson 2004-11-04 15:17:01 UTC
Updates with the proper version and component name.

Comment 3 Kiersten (Kerri) Anderson 2004-11-04 15:21:08 UTC
Updates with the proper version and component name. Again, just love out tools.

Comment 4 Jonathan Earl Brassow 2004-12-17 17:53:56 UTC
- fix bug 143165, 134604, and 133254 - update related issues
  These all seem to be related to the same issue, that is, remote
  nodes were erroneously processing an update as though they were
  the originator - taking on some tasks that didn't belong to them.

  This was causing connect failures, version rollbacks, etc.


Comment 5 Corey Marthaler 2004-12-20 20:19:26 UTC
With exact same senario, the cman_tool join command nolonger fails
with an error. HOWEVER, rather than dropping 2 version levels, it
drops all the way down to first known version level. 

So while all others are at say v8, the recovered node with v8 before
the cman_tool join attempt, drops down to v1 after the cman_tool join.

Comment 6 Jonathan Earl Brassow 2005-01-05 21:54:09 UTC
The problem stemmed from the fact that connect and broadcast request
processing went through different code paths.

When an update happens, a bit is set telling the daemon that the next
request should trigger a read of the config file.  While this happened
correctly for a connect, it did not for a broadcast.

When the node comes back up, the daemon started, and cman_tool join
initiated; the daemon broadcasts for config files.  The other nodes
were responding back with an old version (ignoring the update bit). 
Since the other nodes are quorate, ccsd respects their view instead of
the more recent view it has - thus resulting in a version regression.

Now that this is fixed, the cman_tool join error will come back.  (I
have seen it in the logs - I don't know what Dave has done to the
printout on the command line.)  This is a result of the fact that cman
has not been made aware of the update (via 'cman_tool version -r
<new>').  So the incomming node will have a version number that is
different (higher) than the rest of the nodes - thus, it is rejected.

ccs_tool will soon (i.e now) take away the need to run 'cman_tool
version -r <new>'



Comment 7 Corey Marthaler 2005-01-11 00:06:16 UTC
fix verified.