+++ This bug was initially created as a clone of Bug #568356 +++ Created an attachment (id=396283) Test case Description of problem: Inside my CPG application, The confchg callback is called with 'dead' members: [debug] cpg member node 3 pid 1132 [debug] cpg member node 3 pid 14640 for example process 1132 does not exists any longer on node 3. Version-Release number of selected component (if applicable): TRUNK How reproducible: We have reliable reproducer in attachment. Steps to Reproduce: 1. gcc -Wall cpgtest.c $(shell pkg-config --cflags --libs libcpg libcoroipcc) -o cpgtest 2. keep it run Actual results: # cpgtest ... starting cpgtest calling cpg_initialize calling cpg_join starting main loop (hangs here) Expected results: Never hang Additional info: Taken from OpenAIS mailing list --- Additional comment from jfriesse on 2010-02-25 09:53:26 EST --- Created an attachment (id=396287) Proposed patch Cpg join with undelivered leave message Patch handles situation, when on one node, one process: - join cpg - do same actions - leave cpg - join cpg again Following sequence can (racy) end with broken process_info list. To solve this problem, one more check is done in message_handler_req_lib_cpg_join so if process_info with same pid and group as new join request exists, CPG_ERR_EXIST is returned. --- Additional comment from dietmar on 2010-02-26 03:41:46 EST --- works - no more ghost members. But how can i handle CPG_ERR_EXIST correctly? Simply call join again seems to work: while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN || result == CPG_ERR_EXIST ) { printf("cpg_join returned %d\n", result); sleep (1); } or is there a better way? --- Additional comment from jfriesse on 2010-02-26 03:50:11 EST --- (In reply to comment #2) > works - no more ghost members. > > But how can i handle CPG_ERR_EXIST correctly? > Simply call join again seems to work: > > while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN || > result == CPG_ERR_EXIST ) { > printf("cpg_join returned %d\n", result); > sleep (1); > } > > or is there a better way? Hi, thanks for very good news. About handling. From my point of view, returning CPG_ERR_EXIST is not best way, I will "rework" patch to return CS_ERR_TRY_AGAIN because this is exactly what we need to return in such situations. --- Additional comment from jfriesse on 2010-02-26 04:13:27 EST --- Created an attachment (id=396497) Proposed patch - returns err_try_again Better version of patch, which return CPG_ERR_TRY_AGAIN rather than ERR_EXISTS. --- Additional comment from dietmar on 2010-02-26 04:38:21 EST --- I still get CPG_ERR_TRY_AGAIN sometimes (quite seldom - after running 10 minutes). --- Additional comment from jfriesse on 2010-02-26 04:45:10 EST --- (In reply to comment #5) > I still get CPG_ERR_TRY_AGAIN sometimes (quite seldom - after running 10 > minutes). Ya, thats correct. I think it's better than CPG_ERR_EXIST. Or do you mean some different situation? --- Additional comment from dietmar on 2010-02-26 04:49:54 EST --- Sorry, I still get CPG_ERR_EXIST sometimes. --- Additional comment from jfriesse on 2010-02-26 05:49:39 EST --- (In reply to comment #7) > Sorry, I still get CPG_ERR_EXIST sometimes. Ya, this can happen when you call cpg_join with same pid/nodeid/group_name more than once. I hope it doesn't happening in test you sent (it shouldn't). --- Additional comment from jfriesse on 2010-02-26 06:28:27 EST --- (In reply to comment #8) > (In reply to comment #7) > > Sorry, I still get CPG_ERR_EXIST sometimes. > > Ya, > this can happen when you call cpg_join with same pid/nodeid/group_name more > than once. I hope it doesn't happening in test you sent (it shouldn't). Ok, I must correct myself. It can really happen, and it's because how coroipc is made. What happened in your test case: --- your app + cpg + ipc lib --- - cpg_init + join + ... - cpg_finalize -> coroipcc_service_disconnect -> close fd - cpg_init + join -> error Now finally corosync realize, that your previous fd is closed so calls cpg_lib_exit_fn and this will remove previous cpg_pd from list. I'm not sure, if this need to be handled somehow and if yes, how exactly. Steve, what's your opinion in such case? --- Additional comment from sdake on 2010-03-01 03:59:50 EST --- clone this bz as a new bug related to this separate finalize followed by join issue. I believe the original problem of stale cpg groups is fixed by your patches. Regards -steve
Created attachment 397114 [details] Proposed patch
Dietmar, I hope that patch solve problem you had. So basically, you should never get CPG_ERR_EXIST error in your test case.
Patch is now commited in SVN Trunk. Closing this one as upstream.