Bug 504677 - cpg join confchg wrong after a node fails and restarts
Summary: cpg join confchg wrong after a node fails and restarts
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: corosync
Version: rawhide
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Steven Dake
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-08 19:16 UTC by David Teigland
Modified: 2016-04-27 01:36 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-12-06 00:46:57 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Super ugly workaround (1.62 KB, patch)
2009-06-25 14:58 UTC, Jan Friesse
no flags Details | Diff
Configurable version of super ugly workaround (2.32 KB, patch)
2009-06-25 15:01 UTC, Jan Friesse
no flags Details | Diff

Description David Teigland 2009-06-08 19:16:18 UTC
Description of problem:

using corosync from svn, up to date as of Mon Jun  8 14:06:10 CDT 2009

a node fails, rejoins, calls cpg join, the first confchg shows that it is the only member of the cpg when it's not

cpgx -l0 -e0 -d1

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Steven Dake 2009-06-09 14:25:22 UTC
Honza,

Can you look at this and see if it is fixed by the recent confchg ordering patch sent to the ml?

Thanks
-steve

Comment 2 Bug Zapper 2009-06-09 17:13:50 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 3 David Teigland 2009-06-16 16:55:01 UTC
I reproduce this bug by running cpgx on four nodes.
Two nodes run: cpgx -l0 -e0 -d1
The other two: cpgx -l0 -e0 -d0

The errors can look different depending on exactly what happens, and sometimes the node with the bad confchg won't actually report an error; the others do.  The errors often relate to receiving the standard "time" messages from nodes that were not listed as members in the latest confchg, or receiving the time messages from nodes that aren't synced.  All of these errors can be traced back pretty quickly to the initial bad confchg.

Comment 4 David Teigland 2009-06-16 17:16:09 UTC
Here's an analysis of one error scenario I often see.

nodeid 1
--------
1245171191 D: do die 42842
1245171191 D: killing corosync
1245171201 D: starting corosync
1245171204 D: do join our_nodeid 1
1245171206 H: 00000000 conf 1 1 0 memb 1 join 1 left    <-- The bad confchg
1245171206 ERROR: receive_time from non member
1245171206 ERROR: 00000000 time 4 tv 1245170799.688905 config 61553
1245171206 ERROR: 00000000 conf 1 1 0 memb 1 join 1 left
1245171206 ERROR: receive_time from non member
1245171206 ERROR: 00000000 time 4 tv 1245170799.742887 config 61553
1245171206 ERROR: 00000000 conf 1 1 0 memb 1 join 1 left
...

The bad confchg above should have been "conf 4 1 0 memb 1 2 4 5 join 1 left"
as seen on the other nodes below.  This node reports an error when it receives time message "1245170799.688905" from nodeid 4 since nodeid 4 is not a member according to the bad confchg.

nodeid 2
--------
no errors, continued running, the confchg had scrolled out of buffer

(no nodeid 3 in this cluster)

nodeid 4
--------
1245170803 H: 00061942 conf 4 1 0 memb 1 2 4 5 join 1 left
1245170803 D: update_nodes_list
1245170803 D: nodeid 2 is_member 1 needs_sync 0 join 00056959 check 00061853
1245170803 D: nodeid 4 is_member 1 needs_sync 0 join 00000000 check 00061856
1245170803 D: nodeid 5 is_member 1 needs_sync 0 join 00000094 check 00061856
1245170803 D: nodeid 1 is_member 1 needs_sync 1 join 00061942 check 00060915
1245170803 H: 00061943 time 4 tv 1245170799.688905 config 61553
1245170803 H: 00061944 time 4 tv 1245170799.742887 config 61553
...
1245170803 H: 00062213 time 5 tv 1245170740.838605 config 61553
1245170803 H: 00062214 time 5 tv 1245170740.886586 config 61553
1245170803 H: 00062215 time 4 tv 1245170803.617528 config 61553
1245170803 H: 00062216 time 4 tv 1245170803.659515 config 61553
1245170803 H: 00062217 time 4 tv 1245170803.683505 config 61553
1245170803 H: 00062218 conf 3 0 1 memb 1 4 5 join left 2
1245170803 D: update_nodes_list
1245170803 D: nodeid 2 is_member 0 needs_sync 0 join 00056959 check 00061853
1245170803 D: nodeid 4 is_member 1 needs_sync 0 join 00000000 check 00061856
1245170803 D: nodeid 5 is_member 1 needs_sync 0 join 00000094 check 00061856
1245170803 D: nodeid 1 is_member 1 needs_sync 1 join 00061942 check 00060915
1245170803 H: 00062219 time 5 tv 1245170740.912262 config 61942
1245170803 H: 00062220 time 4 tv 1245170803.733567 config 61942
1245170803 ERROR: receive_time from 1 needs_sync
1245170803 ERROR: 00000000 time 1 tv 1245171206.126757 config 0


nodeid 5
--------
1245170740 H: 00061942 conf 4 1 0 memb 1 2 4 5 join 1 left
1245170740 D: update_nodes_list
1245170740 D: nodeid 2 is_member 1 needs_sync 0 join 00056959 check 00061853
1245170740 D: nodeid 4 is_member 1 needs_sync 0 join 00000000 check 00061856
1245170740 D: nodeid 5 is_member 1 needs_sync 0 join 00000000 check 00061856
1245170740 D: nodeid 1 is_member 1 needs_sync 1 join 00061942 check 00060915
1245170740 H: 00061943 time 4 tv 1245170799.688905 config 61553
1245170740 H: 00061944 time 4 tv 1245170799.742887 config 61553
...
1245170740 H: 00062213 time 5 tv 1245170740.838605 config 61553
1245170740 H: 00062214 time 5 tv 1245170740.886586 config 61553
1245170740 H: 00062215 time 4 tv 1245170803.617528 config 61553
1245170740 H: 00062216 time 4 tv 1245170803.659515 config 61553
1245170740 H: 00062217 time 4 tv 1245170803.683505 config 61553
1245170740 H: 00062218 conf 3 0 1 memb 1 4 5 join left 2
1245170740 D: update_nodes_list
1245170740 D: nodeid 2 is_member 0 needs_sync 0 join 00056959 check 00061853
1245170740 D: nodeid 4 is_member 1 needs_sync 0 join 00000000 check 00061856
1245170740 D: nodeid 5 is_member 1 needs_sync 0 join 00000000 check 00061856
1245170740 D: nodeid 1 is_member 1 needs_sync 1 join 00061942 check 00060915
1245170740 H: 00062219 time 5 tv 1245170740.912262 config 61942
1245170740 H: 00062220 time 4 tv 1245170803.733567 config 61942
1245170740 ERROR: receive_time from 1 needs_sync
1245170740 ERROR: 00000000 time 1 tv 1245171206.126757 config 0

The cpgx test program doesn't send time messages until it is synced with other nodes.  Because nodeid 1 thinks it's the only member (per the bad confchg), it doesn't wait for any syncing and just begins sending time messages right away.
The other nodes then report errors when they receive time messages from 1 when they were not expecting any.

Comment 5 David Teigland 2009-06-16 19:44:12 UTC
Running this test on RHEL5.4-Server-20090608.0, openais-0.80.6-2.el5

To run cpgx on RHEL5 with the default 10 second token timeout, you need to add the -w 20 option to cpgx to avoid bug 506255.

The problem still exists, but is slightly different.  The joining node receives the same bad confchg, but after that it also receives bogus confchg's showing each existing member being added:

1245180884 D: do die 12757
1245180884 D: killing aisexec
1245180904 D: starting aisexec
1245180904 D: do join our_nodeid 1
1245180904 H: 00000000 conf 1 1 0 memb 1 join 1 left
1245180904 H: 00000001 conf 2 1 0 memb 1 2 join 2 left
1245180904 H: 00000002 conf 3 1 0 memb 1 2 3 join 3 left
1245180904 H: 00000003 conf 4 1 0 memb 1 2 3 4 join 4 left

There's not really much difference between one bad confchg and four, since the first is fatal.

Comment 6 David Teigland 2009-06-16 20:09:34 UTC
I get the same results with openais-0.80.3-22.el5, which is the 5.3.0 package AFAIK.

Comment 7 Jan Friesse 2009-06-24 15:03:48 UTC
David,
this is how currently cpg works:
- you have node 1,2,3
- everywhere is some cpg connection
- on node 1, you will shut down corosync
- you have some cpg, what will reconnect very very fast
- you will start corosync on node 1
- following situation will happend:
* node 1 will create ring with one item. itself (this is message, which you see)
* cpg will connect to that node (why not) and will receive confch about group where is one node (I think, this is correct)
* now, sync will begin
* node 2,3 will send joinlist about itself
* node 1 will send joinlist about itself
* so node 2,3 will get message node 1 up
* and node 1 will get message node 2 and 3 up
* now on node 3 is something what sending regular messages very fast
* and some of that messages are delivered (and this is what I don't know for sure) in during second sync but BEFORE node 1 will get node 2,3 up confch. 
This really shouldn't happend and I have a fix for it

And what you know for cpg to do is, to deliver same messages on every node (at least I hope I understand cpx well). I think, this is impossible to do. Try imagine what will happend if you have 6 nodes, and it will split to 3 and 3. Then it rejoins. What is the right half of nodes, who should send confchcb about join/leave.

Comment 8 David Teigland 2009-06-24 15:51:46 UTC
I'm afraid I don't follow, let's try again.

In comment 4, cpgx on nodeid 1 joins the cpg and receives this confchg:

  1245171206 H: 00000000 conf 1 1 0 memb 1 join 1 left

That is incorrect.  It should have received a confchg like the others:

  conf 4 1 0 memb 1 2 4 5 join 1 left

Comment 9 Steven Dake 2009-06-24 18:33:13 UTC
The assertion in comment #8 requires clarification.  The output you see is correct if the cpg_join on node 1 happens before totem forms a network of 1, 2, 3 (Ie: it is a singleton ring at startup for example).  It is not acceptable if totem has already formed a network of 1, 2, 3.

Does the test verify that totem has formed a network before executing the cpg_join?

Regards
-steve

Comment 10 David Teigland 2009-06-24 18:53:50 UTC
I use "cman_tool join -w".  The -w is supposed to wait for the node to be a member of the cluster, but perhaps it doesn't quite work.

I also want to support starting the cluster without cman_tool, by just running aisexec/corosync directly.  In that case how can I tell when the node is a member?

Comment 11 Steven Dake 2009-06-24 19:00:52 UTC
There is no preset list of processors which make up the membership - its entirely dynamic.  As a result, there is no reasonable way to determine that the cluster is fully joined before doing a cpg_join.  One option is to write a cpg app to use cpg_local_get to retrieve the membership you expect before executing a cpg_join.

Comment 12 David Teigland 2009-06-24 19:24:58 UTC
> There is no preset list of processors which make up the membership - its entirely dynamic.

Of course membership is dynamic...

> As a result, there is no reasonable way to determine that the cluster is fully joined before doing a cpg_join.

If other nodes are there, then a membership exists, and the membership protocol should find them.  Once this protocol has run through, that's when I'd say we've "joined the cluster" or "become a member".  Until then, from the perspective of an application, it's pointless to say/do anything, and possibly even harmful.

> One option is to write a cpg app to use cpg_local_get to retrieve the membership you expect before executing a cpg_join.

There is no expected membership, it's dynamic and we're joining to find it out!

Comment 13 David Teigland 2009-06-24 19:41:22 UTC
FWIW, the -w option to cman_tool join is important beyond this test program.  init.d/cman depends on it.

Comment 14 David Teigland 2009-06-24 21:00:29 UTC
I'm trying a work-around of adding sleep(5) after the cman_tool join -w.  In my case, 5 seconds should be enough time for the join to really complete before restarting cpgx.  I haven't yet seen the same problem.  I'll try to confirm tomorrow that without the sleep, the failure occurs when the node is the only cluster member.

Comment 15 Jan Friesse 2009-06-25 14:58:25 UTC
Created attachment 349409 [details]
Super ugly workaround

David,
included is super ugly workaround for problem, based on similar (same) idea as your last comment. After sync is finished, we wait some time (5-6 sec) and in this time, no new join calls are allowed (or better, in this time, we still return error 6).

Workaround is so ugly, that I rather don't send it to ml but looks like solves this bug.

Comment 16 Jan Friesse 2009-06-25 15:01:54 UTC
Created attachment 349410 [details]
Configurable version of super ugly workaround

Included is configurable version of previous workaround. It's configured in config file like:

cpg {
  sync_timeout: 4
}

Default is 0 -> workaround is not used. Sync timeout is bad name, so in case we will decide to include this and previous "patch", it should be nice to think about more nice name.

This configuration seems to "solved" problem. Again I didn't send this to ml, because it's no real solution. (but I think, no real solution exists :( ).

Comment 17 David Teigland 2009-06-25 16:15:24 UTC
So, I think it may be better to approach this problem from the angle of fixing "cman_tool join -w".  That's the bigger problem I think, and the cpg issues are a symptom of it.

We want the -w option to cause cman_tool join to wait until the membership protocol has run to the point where it would detect other existing members and then come to agreement with them on the new membership.

Comment 18 Jan Friesse 2009-06-29 09:40:11 UTC
David,
I played while with cman_tool join -w. I'm pretty sure, it doesn't work (or maybe works totally different then I thought). I configured 3 nodes cluster with votes 1. So Cluster should be quorate when two nodes are active. But, in case I run only 1 node, I'm able to run testcpg (or cpgx) and it works (and cluster is NOT quorate).

In case, cman_tool join -w will work as expected (or better as I expect), it should solve this problem for >2 nodes cluster (or better cluster where one node is never quorate).

But now, it looks like cman_tool join -w is same as running corosync itself (for cpg)

Comment 19 Steven Dake 2009-06-29 19:20:52 UTC
Assigning to chrissie since this appears to be a cman issue not a cpg issue.

If after making the cman_tool -w command work properly there is still an issue with corosync, please reassign.

Comment 20 Jan Friesse 2009-06-30 14:24:36 UTC
I played more with cman_tool. It looks like I badly understand -w. -w just waits, until corosync will join cluster (and this is, when cman_get_node_count >=1). What I was thinking -w is doing, actually does -q (wait for cluster to be quorate), and it looks like it works like cman_tool doesn't stop until cluster is quorate... But anyway, still can use cpgx, without any problem (on not quorate cluster).

Comment 21 David Teigland 2009-06-30 19:07:25 UTC
- cluster members=A,B,C, cpg members=A,B,C
- node D: runs cman_tool join -w
- node D: cman_tool should not return until it recognizes members=A,B,C,D
  (I'd suggest that this -w behavior be the default, but that's a minor issue)
- node D: cpgx is started after cman_tool join -w returns
- node D: cpgx joins the cpg
- node D: the first cpg confchg should show cpg members=A,B,C,D,
  not cpg members=D as comment 8 shows.

Comment 22 Christine Caulfield 2009-07-13 10:23:48 UTC
honza, being able to use cpg on a non-quorate cluster is normal behaviour. in RHEL5 there is no concept of quorum outside of cman.

Comment 23 Christine Caulfield 2009-07-16 08:10:25 UTC
Dave, so you're saying that 'cman_tool join -w' should not return until the number of nodes in a cluster is higher than 1 ? I'm not sure how popular that will be.

cman_tool join -w is nothing to do with how many nodes are in the cluster. Its job is to wait unil 'cman' is ready to accept other commands. It makes no assumptions as to the membership state at all. The main purpose of it was in RHEL4 where the membership join protocol (even, in fact PARTICULARLY) to get a single node up could take several seconds.

If you want a "wait until number of nodes >1" then it can be added if there's a good enough reason, but I'm not sure the standard -w is the place.

Comment 24 David Teigland 2009-07-16 18:40:25 UTC
It's not >1 nodes I'm interested in per se.  My understanding from the comments is that if a cluster of 3 nodes exists, when a fourth node joins, it initially sees itself as a lone member of the cluster, and then later in the process sees itself along with the other three.  What we need is a way to wait until this later point has been reached, where, if a cluster exists already it has been recognized.  My suggestion was that -w be used to wait for that later point instead of the former, but another mechanism would suffice.  (We'll want to use it from init.d/cman, and other programs like cpgx will also want to use it.)

If apps start up and run believing the initial lone partition, I believe it will cause various unwanted results.  The initial "lone member by myself state" is generally useless (and potentially harmful) to apps using the cluster since the whole point of a membership system is to know of other nodes if they exist.

What I've been doing to work around this is adding a sleep(5) after cman_tool join and hoping that that's long enough for the true cluster membership to be found.

Comment 25 David Teigland 2009-07-16 19:50:06 UTC
BTW, I'm also open to other ways to work around this, but ultimately it would be nice to implement the solution or work-around in one spot for everyone to use instead of making everyone invent their own ad hoc method.  That's what led me back to -w.

Comment 26 Christine Caulfield 2009-07-17 08:13:02 UTC
But corosync doesn't know, when it starts up, if there are other nodes in its cluster. That's just what it does, it starts up as a single node and then that node joins the rest.

I realise it's very different from RHEL4 which actively went looking for other nodes before it formed a cluster (which caused the need for -w in the first place!).

The alternative is to say that a cluster of 1 node is not a cluster and unsupported; nothing will then work until you have more than one node. Personally, I'm not totally averse to that, but it's a policy change we need to be very careful about making.

Comment 27 Steven Dake 2009-07-17 13:50:41 UTC
After talking with dct last night, I spent alot of time thinking about the particular problem and here is what I've come up with.

The original totem designers designed totem so there was  a static list of processors.  They loaded this into "my_proc_list".  This would trigger a consensus timeout in the single node case.

Since totemsrp in corosync is totally dynamic, it determines consensus as soon as 1 processor sends a join message.  When this happens during startup, consensus is immediately achieved (my_proc_list = join.proc_list and my_fail_list=join.fail_list, setting consensus for join.source, hence consensus is achieved).

I believe we need to more closely emulate the behavior of how consensus is determined in the original model.

When my_proc_list_entries = 1, consensus should not be achieved on receipt of join messages, only consensus timeout.

This provides some time for the membership algorithm to build up the my_proc_list and gets rid of the 1 node membership config changes in the single node case when there is an existing cluster.

I'll take a look at that on Sunday/Monday.  This would require that whichever option is used by cman_tool to "wait for membership" should wait until the first regular configuration change before returning to the user.  I'll let chrissie sort out which option should be used within cman.

Comment 28 Christine Caulfield 2009-07-20 09:33:14 UTC
Steve, that sounds lovely.

If you can implement this then it sounds like an ideal use for 'cman_tool join -w' which should also minimise any further impact.

Comment 29 David Teigland 2009-08-06 20:12:34 UTC
This would still be nice, but it's not as important now that init.d/cman is waiting for quorum.


Note You need to log in before you can comment on or make changes to this bug.