+++ This bug was initially created as a clone of Bug #568356 +++ Created an attachment (id=396283) Test case Description of problem: Inside my CPG application, The confchg callback is called with 'dead' members: [debug] cpg member node 3 pid 1132 [debug] cpg member node 3 pid 14640 for example process 1132 does not exists any longer on node 3. Version-Release number of selected component (if applicable): TRUNK How reproducible: We have reliable reproducer in attachment. Steps to Reproduce: 1. gcc -Wall cpgtest.c $(shell pkg-config --cflags --libs libcpg libcoroipcc) -o cpgtest 2. keep it run Actual results: # cpgtest ... starting cpgtest calling cpg_initialize calling cpg_join starting main loop (hangs here) Expected results: Never hang Additional info: Taken from OpenAIS mailing list --- Additional comment from jfriesse on 2010-02-25 09:53:26 EST --- Created an attachment (id=396287) Proposed patch Cpg join with undelivered leave message Patch handles situation, when on one node, one process: - join cpg - do same actions - leave cpg - join cpg again Following sequence can (racy) end with broken process_info list. To solve this problem, one more check is done in message_handler_req_lib_cpg_join so if process_info with same pid and group as new join request exists, CPG_ERR_EXIST is returned. --- Additional comment from dietmar on 2010-02-26 03:41:46 EST --- works - no more ghost members. But how can i handle CPG_ERR_EXIST correctly? Simply call join again seems to work: while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN || result == CPG_ERR_EXIST ) { printf("cpg_join returned %d\n", result); sleep (1); } or is there a better way? --- Additional comment from jfriesse on 2010-02-26 03:50:11 EST --- (In reply to comment #2) > works - no more ghost members. > > But how can i handle CPG_ERR_EXIST correctly? > Simply call join again seems to work: > > while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN || > result == CPG_ERR_EXIST ) { > printf("cpg_join returned %d\n", result); > sleep (1); > } > > or is there a better way? Hi, thanks for very good news. About handling. From my point of view, returning CPG_ERR_EXIST is not best way, I will "rework" patch to return CS_ERR_TRY_AGAIN because this is exactly what we need to return in such situations. --- Additional comment from jfriesse on 2010-02-26 04:13:27 EST --- Created an attachment (id=396497) Proposed patch - returns err_try_again Better version of patch, which return CPG_ERR_TRY_AGAIN rather than ERR_EXISTS.
Created attachment 396500 [details] Backport of patch and SVN#2364 Back ported version of patch. It also includes SVN#2364, because that one must be applied for make new patch work correctly.
Created attachment 396501 [details] Test case backported for whitetank
Honza, Please keep each commit from trunk as a separate backport and post to the ml rather then merging both. Regards -steve
Created attachment 397130 [details] Original patch split - part 1 Allow only one connection per node pid.
Created attachment 397132 [details] Original patch split - part 2 Cpg join with undelivered leave message.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: In rare circumstances, an invalid CPG member was delivered in a configuration change callback.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0100.html