569525 – CPG - finalize followed by join can result in error code 14

Bug 569525 - CPG - finalize followed by join can result in error code 14

Summary: CPG - finalize followed by join can result in error code 14

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	corosync
Sub Component:
Version:	rawhide
Hardware:	All
OS:	All
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Jan Friesse
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:	568356
Blocks:	568510 568650
TreeView+	depends on / blocked

Reported:	2010-03-01 16:37 UTC by Jan Friesse
Modified:	2010-03-04 12:18 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:	568356
Environment:
Last Closed:	2010-03-04 12:18:47 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Proposed patch (4.88 KB, patch) 2010-03-01 16:39 UTC, Jan Friesse	no flags	Details \| Diff
View All

Description Jan Friesse 2010-03-01 16:37:12 UTC

+++ This bug was initially created as a clone of Bug #568356 +++

Created an attachment (id=396283)
Test case

Description of problem:
Inside my CPG application, The confchg callback is called with 'dead'
members:

[debug] cpg member node 3 pid 1132
[debug] cpg member node 3 pid 14640

for example process 1132 does not exists any longer on node 3.


Version-Release number of selected component (if applicable):
TRUNK

How reproducible:
We have reliable reproducer in attachment.

Steps to Reproduce:
1. gcc -Wall cpgtest.c $(shell pkg-config --cflags --libs libcpg libcoroipcc) -o cpgtest
2. keep it run
  
Actual results:
# cpgtest
...
starting cpgtest
calling cpg_initialize
calling cpg_join
starting main loop (hangs here)

Expected results:
Never hang

Additional info:
Taken from OpenAIS mailing list

--- Additional comment from jfriesse on 2010-02-25 09:53:26 EST ---

Created an attachment (id=396287)
Proposed patch

Cpg join with undelivered leave message

Patch handles situation, when on one node, one process:
- join cpg
- do same actions
- leave cpg
- join cpg again

Following sequence can (racy) end with broken process_info list.

To solve this problem, one more check is done in
message_handler_req_lib_cpg_join so if process_info with same pid and
group as new join request exists, CPG_ERR_EXIST is returned.

--- Additional comment from dietmar on 2010-02-26 03:41:46 EST ---

works - no more ghost members.

But how can i handle CPG_ERR_EXIST correctly? 
Simply call join again seems to work:

	while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN ||
		result == CPG_ERR_EXIST ) { 
		printf("cpg_join returned %d\n", result);
		sleep (1);
	}

or is there a better way?

--- Additional comment from jfriesse on 2010-02-26 03:50:11 EST ---

(In reply to comment #2)
> works - no more ghost members.
> 
> But how can i handle CPG_ERR_EXIST correctly? 
> Simply call join again seems to work:
> 
>  while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN ||
>   result == CPG_ERR_EXIST ) { 
>   printf("cpg_join returned %d\n", result);
>   sleep (1);
>  }
> 
> or is there a better way?    

Hi,
thanks for very good news.

About handling. From my point of view, returning CPG_ERR_EXIST is not best way, I will "rework" patch to return CS_ERR_TRY_AGAIN because this is exactly what we need to return in such situations.

--- Additional comment from jfriesse on 2010-02-26 04:13:27 EST ---

Created an attachment (id=396497)
Proposed patch - returns err_try_again

Better version of patch, which return CPG_ERR_TRY_AGAIN rather than ERR_EXISTS.

--- Additional comment from dietmar on 2010-02-26 04:38:21 EST ---

I still get CPG_ERR_TRY_AGAIN sometimes (quite seldom - after running 10 minutes).

--- Additional comment from jfriesse on 2010-02-26 04:45:10 EST ---

(In reply to comment #5)
> I still get CPG_ERR_TRY_AGAIN sometimes (quite seldom - after running 10
> minutes).    

Ya, thats correct. I think it's better than CPG_ERR_EXIST. Or do you mean some different situation?

--- Additional comment from dietmar on 2010-02-26 04:49:54 EST ---

Sorry, I still get CPG_ERR_EXIST sometimes.

--- Additional comment from jfriesse on 2010-02-26 05:49:39 EST ---

(In reply to comment #7)
> Sorry, I still get CPG_ERR_EXIST sometimes.    

Ya,
this can happen when you call cpg_join with same pid/nodeid/group_name more than once. I hope it doesn't happening in test you sent (it shouldn't).

--- Additional comment from jfriesse on 2010-02-26 06:28:27 EST ---

(In reply to comment #8)
> (In reply to comment #7)
> > Sorry, I still get CPG_ERR_EXIST sometimes.    
> 
> Ya,
> this can happen when you call cpg_join with same pid/nodeid/group_name more
> than once. I hope it doesn't happening in test you sent (it shouldn't).    

Ok, I must correct myself. It can really happen, and it's because how coroipc is made. What happened in your test case:

--- your app + cpg + ipc lib ---
- cpg_init + join + ...
- cpg_finalize -> coroipcc_service_disconnect -> close fd
- cpg_init + join -> error

Now finally corosync realize, that your previous fd is closed so calls cpg_lib_exit_fn and this will remove previous cpg_pd from list.

I'm not sure, if this need to be handled somehow and if yes, how exactly. Steve, what's your opinion in such case?

--- Additional comment from sdake on 2010-03-01 03:59:50 EST ---

clone this bz as a new bug related to this separate finalize followed by join issue.  I believe the original problem of stale cpg groups is fixed by your patches.

Regards
-steve

Comment 1 Jan Friesse 2010-03-01 16:39:32 UTC

Created attachment 397114 [details]
Proposed patch

Comment 2 Jan Friesse 2010-03-01 16:41:19 UTC

Dietmar,
I hope that patch solve problem you had. So basically, you should never get CPG_ERR_EXIST error in your test case.

Comment 3 Jan Friesse 2010-03-04 12:18:47 UTC

Patch is now commited in SVN Trunk. Closing this one as upstream.

Note You need to log in before you can comment on or make changes to this bug.