133254 – ccsd cluster.conf version drops 2 version levels after an update follwed by recovery

Bug 133254 - ccsd cluster.conf version drops 2 version levels after an update follwed by recovery

Summary: ccsd cluster.conf version drops 2 version levels after an update follwed by r...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	ccs
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-09-22 19:08 UTC by Corey Marthaler
Modified:	2009-04-16 20:03 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-01-11 00:06:16 UTC
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2004-09-22 19:08:35 UTC

Description of problem:
1. get all nodes in cluster to a particlular cluster.conf version
level greater than 2
2. then update one node
3. send sig -HUP on that node
4. then kill that node
5. verify the cluster.conf version levels on all other nodes, should
have gotten bumped.
6. bring killed node back up
7. check it's cluster.conf version, should be same as all others
8. start ccsd
9. check it's cluster.conf version, should be same as all others
10. attempt cman_tool join

You then get a:
CMAN: Cluster membership rejected

11. check it's cluster.conf version, it goes down exactly 2 version
levels.

CRAZY! :)

How reproducible:
Always

Comment 1 Kiersten (Kerri) Anderson 2004-11-04 15:08:13 UTC

Updates with the proper version and component name.

Comment 2 Kiersten (Kerri) Anderson 2004-11-04 15:17:01 UTC

Updates with the proper version and component name.

Comment 3 Kiersten (Kerri) Anderson 2004-11-04 15:21:08 UTC

Updates with the proper version and component name. Again, just love out tools.

Comment 4 Jonathan Earl Brassow 2004-12-17 17:53:56 UTC

- fix bug 143165, 134604, and 133254 - update related issues
  These all seem to be related to the same issue, that is, remote
  nodes were erroneously processing an update as though they were
  the originator - taking on some tasks that didn't belong to them.

  This was causing connect failures, version rollbacks, etc.

Comment 5 Corey Marthaler 2004-12-20 20:19:26 UTC

With exact same senario, the cman_tool join command nolonger fails
with an error. HOWEVER, rather than dropping 2 version levels, it
drops all the way down to first known version level. 

So while all others are at say v8, the recovered node with v8 before
the cman_tool join attempt, drops down to v1 after the cman_tool join.

Comment 6 Jonathan Earl Brassow 2005-01-05 21:54:09 UTC

The problem stemmed from the fact that connect and broadcast request
processing went through different code paths.

When an update happens, a bit is set telling the daemon that the next
request should trigger a read of the config file.  While this happened
correctly for a connect, it did not for a broadcast.

When the node comes back up, the daemon started, and cman_tool join
initiated; the daemon broadcasts for config files.  The other nodes
were responding back with an old version (ignoring the update bit). 
Since the other nodes are quorate, ccsd respects their view instead of
the more recent view it has - thus resulting in a version regression.

Now that this is fixed, the cman_tool join error will come back.  (I
have seen it in the logs - I don't know what Dave has done to the
printout on the command line.)  This is a result of the fact that cman
has not been made aware of the update (via 'cman_tool version -r
<new>').  So the incomming node will have a version number that is
different (higher) than the rest of the nodes - thus, it is rejected.

ccs_tool will soon (i.e now) take away the need to run 'cman_tool
version -r <new>'

Comment 7 Corey Marthaler 2005-01-11 00:06:16 UTC

fix verified.

Note You need to log in before you can comment on or make changes to this bug.