Bug 507953 - cpg confchg removes nodes that didn't leave or fail
cpg confchg removes nodes that didn't leave or fail
Product: Fedora
Classification: Fedora
Component: corosync (Show other bugs)
All Linux
low Severity medium
: ---
: ---
Assigned To: Steven Dake
Fedora Extras Quality Assurance
Depends On:
  Show dependency treegraph
Reported: 2009-06-24 16:10 EDT by David Teigland
Modified: 2016-04-26 17:14 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-07-08 13:16:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description David Teigland 2009-06-24 16:10:35 EDT
Description of problem:

I think this is new regression (I've not seen it before) since updating to

[svn/corosync/trunk]% svn info
Path: .
URL: svn+ssh://svn.fedorahosted.org/svn/corosync/trunk
Repository Root: svn+ssh://svn.fedorahosted.org/svn/corosync
Repository UUID: fd59a12c-fef9-0310-b244-a6a79926bd2f
Revision: 2289
Node Kind: directory
Schedule: normal
Last Changed Author: sdake
Last Changed Rev: 2289
Last Changed Date: 2009-06-24 00:21:13 -0500 (Wed, 24 Jun 2009)  

I'm trying to test a work-around to bz 504677 where I add a sleep(5) after the cman_tool join -w in cpgx to make sure that the node has really joined the cluster before joining the cpg and starting the test.

Two nodes (1 and 2) run: cpgx -l0 -e0 -d1
The other two (4 and 5): cpgx -l0 -e0 -d0

(sometimes this test hits bz 504036, in which case I kill the stuck cpgx and restart it manually)

What I see is nodes 1 and 2 both die, the remaining nodes 4,5 continue running, just sending messages.  Then 4,5 both get this confchg:

conf 1 0 1 memb 5 join left 4  -- indicating that node 4 has left/failed

then node 5 (the only remaining node) gets this confchg:

conf 0 0 1 memb join left 5  -- indicating that it too has left/failed

Neither 4 or 5 left the cpg or failed, there should be no confchg's.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Actual results:

Expected results:

Additional info:
Comment 1 Steven Dake 2009-06-25 15:08:55 EDT
Identify if this is a regression.  Dave said he saw it recently after upgrading corosync but didn't find it previously.   Try an older version to reproduce, and bisect to identify the patch that introduced the problem if it is a regression.

Comment 2 Jan Friesse 2009-06-26 05:56:34 EDT
I have only 3 nodes, so I tested:
- 2 nodes cpgx -l0 -e0 -d1 and 1 node cpgx -l0 -e0 -d0
- 2 nodes cpgx -l0 -e0 -d0 and 1 node cpgx -l0 -e0 -d1

I was not able to reproduce this issue (current trunk). Is that issue need more then 3 nodes to reproduce?
Comment 3 David Teigland 2009-06-26 12:39:41 EDT
I've not been able to hit this with three nodes, so it looks like you'll need four.
Comment 4 Jan Friesse 2009-06-29 05:48:48 EDT
Sadly, I have only 3 nodes available, so I'm reassigning this back to Steve (I hope, he has >3 nodes)
Comment 5 David Teigland 2009-07-08 13:16:50 EDT
Using Honza's cpgx fix from bug 504036, I've not been able to reproduce this problem.

Note You need to log in before you can comment on or make changes to this bug.