Red Hat Bugzilla – Bug 507953
cpg confchg removes nodes that didn't leave or fail
Last modified: 2016-04-26 17:14:07 EDT
Description of problem:
I think this is new regression (I've not seen it before) since updating to
[svn/corosync/trunk]% svn info
Repository Root: svn+ssh://svn.fedorahosted.org/svn/corosync
Repository UUID: fd59a12c-fef9-0310-b244-a6a79926bd2f
Node Kind: directory
Last Changed Author: sdake
Last Changed Rev: 2289
Last Changed Date: 2009-06-24 00:21:13 -0500 (Wed, 24 Jun 2009)
I'm trying to test a work-around to bz 504677 where I add a sleep(5) after the cman_tool join -w in cpgx to make sure that the node has really joined the cluster before joining the cpg and starting the test.
Two nodes (1 and 2) run: cpgx -l0 -e0 -d1
The other two (4 and 5): cpgx -l0 -e0 -d0
(sometimes this test hits bz 504036, in which case I kill the stuck cpgx and restart it manually)
What I see is nodes 1 and 2 both die, the remaining nodes 4,5 continue running, just sending messages. Then 4,5 both get this confchg:
conf 1 0 1 memb 5 join left 4 -- indicating that node 4 has left/failed
then node 5 (the only remaining node) gets this confchg:
conf 0 0 1 memb join left 5 -- indicating that it too has left/failed
Neither 4 or 5 left the cpg or failed, there should be no confchg's.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Identify if this is a regression. Dave said he saw it recently after upgrading corosync but didn't find it previously. Try an older version to reproduce, and bisect to identify the patch that introduced the problem if it is a regression.
I have only 3 nodes, so I tested:
- 2 nodes cpgx -l0 -e0 -d1 and 1 node cpgx -l0 -e0 -d0
- 2 nodes cpgx -l0 -e0 -d0 and 1 node cpgx -l0 -e0 -d1
I was not able to reproduce this issue (current trunk). Is that issue need more then 3 nodes to reproduce?
I've not been able to hit this with three nodes, so it looks like you'll need four.
Sadly, I have only 3 nodes available, so I'm reassigning this back to Steve (I hope, he has >3 nodes)
Using Honza's cpgx fix from bug 504036, I've not been able to reproduce this problem.