Bug 568356 - stale CPG members in confchg callback
Summary: stale CPG members in confchg callback
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: corosync
Version: rawhide
Hardware: All
OS: All
low
low
Target Milestone: ---
Assignee: Jan Friesse
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 568510 568650 569525
TreeView+ depends on / blocked
 
Reported: 2010-02-25 14:51 UTC by Jan Friesse
Modified: 2011-01-11 23:21 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
In rare circumstances, an invalid CPG member was delivered in a configuration change callback.
Clone Of:
: 568510 568650 569525 (view as bug list)
Environment:
Last Closed: 2010-03-04 12:15:41 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Test case (3.94 KB, text/x-csrc)
2010-02-25 14:51 UTC, Jan Friesse
no flags Details
Proposed patch (1.46 KB, patch)
2010-02-25 14:53 UTC, Jan Friesse
no flags Details | Diff
Proposed patch - returns err_try_again (1.47 KB, patch)
2010-02-26 09:13 UTC, Jan Friesse
no flags Details | Diff

Description Jan Friesse 2010-02-25 14:51:49 UTC
Created attachment 396283 [details]
Test case

Description of problem:
Inside my CPG application, The confchg callback is called with 'dead'
members:

[debug] cpg member node 3 pid 1132
[debug] cpg member node 3 pid 14640

for example process 1132 does not exists any longer on node 3.


Version-Release number of selected component (if applicable):
TRUNK

How reproducible:
We have reliable reproducer in attachment.

Steps to Reproduce:
1. gcc -Wall cpgtest.c $(shell pkg-config --cflags --libs libcpg libcoroipcc) -o cpgtest
2. keep it run
  
Actual results:
# cpgtest
...
starting cpgtest
calling cpg_initialize
calling cpg_join
starting main loop (hangs here)

Expected results:
Never hang

Additional info:
Taken from OpenAIS mailing list

Comment 1 Jan Friesse 2010-02-25 14:53:26 UTC
Created attachment 396287 [details]
Proposed patch

Cpg join with undelivered leave message

Patch handles situation, when on one node, one process:
- join cpg
- do same actions
- leave cpg
- join cpg again

Following sequence can (racy) end with broken process_info list.

To solve this problem, one more check is done in
message_handler_req_lib_cpg_join so if process_info with same pid and
group as new join request exists, CPG_ERR_EXIST is returned.

Comment 2 dietmar 2010-02-26 08:41:46 UTC
works - no more ghost members.

But how can i handle CPG_ERR_EXIST correctly? 
Simply call join again seems to work:

	while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN ||
		result == CPG_ERR_EXIST ) { 
		printf("cpg_join returned %d\n", result);
		sleep (1);
	}

or is there a better way?

Comment 3 Jan Friesse 2010-02-26 08:50:11 UTC
(In reply to comment #2)
> works - no more ghost members.
> 
> But how can i handle CPG_ERR_EXIST correctly? 
> Simply call join again seems to work:
> 
>  while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN ||
>   result == CPG_ERR_EXIST ) { 
>   printf("cpg_join returned %d\n", result);
>   sleep (1);
>  }
> 
> or is there a better way?    

Hi,
thanks for very good news.

About handling. From my point of view, returning CPG_ERR_EXIST is not best way, I will "rework" patch to return CS_ERR_TRY_AGAIN because this is exactly what we need to return in such situations.

Comment 4 Jan Friesse 2010-02-26 09:13:27 UTC
Created attachment 396497 [details]
Proposed patch - returns err_try_again

Better version of patch, which return CPG_ERR_TRY_AGAIN rather than ERR_EXISTS.

Comment 5 dietmar 2010-02-26 09:38:21 UTC
I still get CPG_ERR_TRY_AGAIN sometimes (quite seldom - after running 10 minutes).

Comment 6 Jan Friesse 2010-02-26 09:45:10 UTC
(In reply to comment #5)
> I still get CPG_ERR_TRY_AGAIN sometimes (quite seldom - after running 10
> minutes).    

Ya, thats correct. I think it's better than CPG_ERR_EXIST. Or do you mean some different situation?

Comment 7 dietmar 2010-02-26 09:49:54 UTC
Sorry, I still get CPG_ERR_EXIST sometimes.

Comment 8 Jan Friesse 2010-02-26 10:49:39 UTC
(In reply to comment #7)
> Sorry, I still get CPG_ERR_EXIST sometimes.    

Ya,
this can happen when you call cpg_join with same pid/nodeid/group_name more than once. I hope it doesn't happening in test you sent (it shouldn't).

Comment 9 Jan Friesse 2010-02-26 11:28:27 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > Sorry, I still get CPG_ERR_EXIST sometimes.    
> 
> Ya,
> this can happen when you call cpg_join with same pid/nodeid/group_name more
> than once. I hope it doesn't happening in test you sent (it shouldn't).    

Ok, I must correct myself. It can really happen, and it's because how coroipc is made. What happened in your test case:

--- your app + cpg + ipc lib ---
- cpg_init + join + ...
- cpg_finalize -> coroipcc_service_disconnect -> close fd
- cpg_init + join -> error

Now finally corosync realize, that your previous fd is closed so calls cpg_lib_exit_fn and this will remove previous cpg_pd from list.

I'm not sure, if this need to be handled somehow and if yes, how exactly. Steve, what's your opinion in such case?

Comment 10 Steven Dake 2010-03-01 08:59:50 UTC
clone this bz as a new bug related to this separate finalize followed by join issue.  I believe the original problem of stale cpg groups is fixed by your patches.

Regards
-steve

Comment 11 Jan Friesse 2010-03-01 16:38:17 UTC
Bug cloned as https://bugzilla.redhat.com/show_bug.cgi?id=569525

Comment 12 Jan Friesse 2010-03-04 12:15:41 UTC
Patch is merged in SVN Trunk, so closing this bug as upstream.

Comment 13 Douglas Silas 2011-01-11 23:21:11 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In rare circumstances, an invalid CPG member was delivered in a configuration change callback.


Note You need to log in before you can comment on or make changes to this bug.