Bug 507953

Summary:	cpg confchg removes nodes that didn't leave or fail
Product:	[Fedora] Fedora	Reporter:	David Teigland <teigland>
Component:	corosync	Assignee:	Steven Dake <sdake>
Status:	CLOSED NOTABUG	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	low
Version:	rawhide	CC:	agk, fdinitto, sdake
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-07-08 17:16:50 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Teigland 2009-06-24 20:10:35 UTC

Description of problem:

I think this is new regression (I've not seen it before) since updating to

[svn/corosync/trunk]% svn info
Path: .
URL: svn+ssh://svn.fedorahosted.org/svn/corosync/trunk
Repository Root: svn+ssh://svn.fedorahosted.org/svn/corosync
Repository UUID: fd59a12c-fef9-0310-b244-a6a79926bd2f
Revision: 2289
Node Kind: directory
Schedule: normal
Last Changed Author: sdake
Last Changed Rev: 2289
Last Changed Date: 2009-06-24 00:21:13 -0500 (Wed, 24 Jun 2009)  


I'm trying to test a work-around to bz 504677 where I add a sleep(5) after the cman_tool join -w in cpgx to make sure that the node has really joined the cluster before joining the cpg and starting the test.

Two nodes (1 and 2) run: cpgx -l0 -e0 -d1
The other two (4 and 5): cpgx -l0 -e0 -d0

(sometimes this test hits bz 504036, in which case I kill the stuck cpgx and restart it manually)

What I see is nodes 1 and 2 both die, the remaining nodes 4,5 continue running, just sending messages.  Then 4,5 both get this confchg:

conf 1 0 1 memb 5 join left 4  -- indicating that node 4 has left/failed

then node 5 (the only remaining node) gets this confchg:

conf 0 0 1 memb join left 5  -- indicating that it too has left/failed

Neither 4 or 5 left the cpg or failed, there should be no confchg's.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Steven Dake 2009-06-25 19:08:55 UTC

Honzaf,
Identify if this is a regression.  Dave said he saw it recently after upgrading corosync but didn't find it previously.   Try an older version to reproduce, and bisect to identify the patch that introduced the problem if it is a regression.

Thanks

Comment 2 Jan Friesse 2009-06-26 09:56:34 UTC

David,
I have only 3 nodes, so I tested:
- 2 nodes cpgx -l0 -e0 -d1 and 1 node cpgx -l0 -e0 -d0
- 2 nodes cpgx -l0 -e0 -d0 and 1 node cpgx -l0 -e0 -d1

I was not able to reproduce this issue (current trunk). Is that issue need more then 3 nodes to reproduce?

Comment 3 David Teigland 2009-06-26 16:39:41 UTC

I've not been able to hit this with three nodes, so it looks like you'll need four.

Comment 4 Jan Friesse 2009-06-29 09:48:48 UTC

Sadly, I have only 3 nodes available, so I'm reassigning this back to Steve (I hope, he has >3 nodes)

Comment 5 David Teigland 2009-07-08 17:16:50 UTC

Using Honza's cpgx fix from bug 504036, I've not been able to reproduce this problem.