568356 – stale CPG members in confchg callback

Bug 568356 - stale CPG members in confchg callback

Summary: stale CPG members in confchg callback

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	corosync
Sub Component:
Version:	rawhide
Hardware:	All
OS:	All
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Jan Friesse
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	568510 568650 569525
TreeView+	depends on / blocked

Reported:	2010-02-25 14:51 UTC by Jan Friesse
Modified:	2011-01-11 23:21 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Clones:	568510 568650 569525 (view as bug list)
Environment:
Last Closed:	2010-03-04 12:15:41 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Test case (3.94 KB, text/x-csrc) 2010-02-25 14:51 UTC, Jan Friesse	no flags	Details
Proposed patch (1.46 KB, patch) 2010-02-25 14:53 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - returns err_try_again (1.47 KB, patch) 2010-02-26 09:13 UTC, Jan Friesse	no flags	Details \| Diff
Show Obsolete (1) View All

Description Jan Friesse 2010-02-25 14:51:49 UTC

Created attachment 396283 [details]
Test case

Description of problem:
Inside my CPG application, The confchg callback is called with 'dead'
members:

[debug] cpg member node 3 pid 1132
[debug] cpg member node 3 pid 14640

for example process 1132 does not exists any longer on node 3.


Version-Release number of selected component (if applicable):
TRUNK

How reproducible:
We have reliable reproducer in attachment.

Steps to Reproduce:
1. gcc -Wall cpgtest.c $(shell pkg-config --cflags --libs libcpg libcoroipcc) -o cpgtest
2. keep it run
  
Actual results:
# cpgtest
...
starting cpgtest
calling cpg_initialize
calling cpg_join
starting main loop (hangs here)

Expected results:
Never hang

Additional info:
Taken from OpenAIS mailing list

Comment 1 Jan Friesse 2010-02-25 14:53:26 UTC

Created attachment 396287 [details]
Proposed patch

Cpg join with undelivered leave message

Patch handles situation, when on one node, one process:
- join cpg
- do same actions
- leave cpg
- join cpg again

Following sequence can (racy) end with broken process_info list.

To solve this problem, one more check is done in
message_handler_req_lib_cpg_join so if process_info with same pid and
group as new join request exists, CPG_ERR_EXIST is returned.

Comment 2 dietmar 2010-02-26 08:41:46 UTC

works - no more ghost members.

But how can i handle CPG_ERR_EXIST correctly? 
Simply call join again seems to work:

	while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN ||
		result == CPG_ERR_EXIST ) { 
		printf("cpg_join returned %d\n", result);
		sleep (1);
	}

or is there a better way?

Comment 3 Jan Friesse 2010-02-26 08:50:11 UTC

(In reply to comment #2)
> works - no more ghost members.
> 
> But how can i handle CPG_ERR_EXIST correctly? 
> Simply call join again seems to work:
> 
>  while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN ||
>   result == CPG_ERR_EXIST ) { 
>   printf("cpg_join returned %d\n", result);
>   sleep (1);
>  }
> 
> or is there a better way?    

Hi,
thanks for very good news.

About handling. From my point of view, returning CPG_ERR_EXIST is not best way, I will "rework" patch to return CS_ERR_TRY_AGAIN because this is exactly what we need to return in such situations.

Comment 4 Jan Friesse 2010-02-26 09:13:27 UTC

Created attachment 396497 [details]
Proposed patch - returns err_try_again

Better version of patch, which return CPG_ERR_TRY_AGAIN rather than ERR_EXISTS.

Comment 5 dietmar 2010-02-26 09:38:21 UTC

I still get CPG_ERR_TRY_AGAIN sometimes (quite seldom - after running 10 minutes).

Comment 6 Jan Friesse 2010-02-26 09:45:10 UTC

(In reply to comment #5)
> I still get CPG_ERR_TRY_AGAIN sometimes (quite seldom - after running 10
> minutes).    

Ya, thats correct. I think it's better than CPG_ERR_EXIST. Or do you mean some different situation?

Comment 7 dietmar 2010-02-26 09:49:54 UTC

Sorry, I still get CPG_ERR_EXIST sometimes.

Comment 8 Jan Friesse 2010-02-26 10:49:39 UTC

(In reply to comment #7)
> Sorry, I still get CPG_ERR_EXIST sometimes.    

Ya,
this can happen when you call cpg_join with same pid/nodeid/group_name more than once. I hope it doesn't happening in test you sent (it shouldn't).

Comment 9 Jan Friesse 2010-02-26 11:28:27 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > Sorry, I still get CPG_ERR_EXIST sometimes.    
> 
> Ya,
> this can happen when you call cpg_join with same pid/nodeid/group_name more
> than once. I hope it doesn't happening in test you sent (it shouldn't).    

Ok, I must correct myself. It can really happen, and it's because how coroipc is made. What happened in your test case:

--- your app + cpg + ipc lib ---
- cpg_init + join + ...
- cpg_finalize -> coroipcc_service_disconnect -> close fd
- cpg_init + join -> error

Now finally corosync realize, that your previous fd is closed so calls cpg_lib_exit_fn and this will remove previous cpg_pd from list.

I'm not sure, if this need to be handled somehow and if yes, how exactly. Steve, what's your opinion in such case?

Comment 10 Steven Dake 2010-03-01 08:59:50 UTC

clone this bz as a new bug related to this separate finalize followed by join issue.  I believe the original problem of stale cpg groups is fixed by your patches.

Regards
-steve

Comment 11 Jan Friesse 2010-03-01 16:38:17 UTC

Bug cloned as https://bugzilla.redhat.com/show_bug.cgi?id=569525

Comment 12 Jan Friesse 2010-03-04 12:15:41 UTC

Patch is merged in SVN Trunk, so closing this bug as upstream.

Comment 13 Douglas Silas 2011-01-11 23:21:11 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In rare circumstances, an invalid CPG member was delivered in a configuration change callback.

Note You need to log in before you can comment on or make changes to this bug.