Created attachment 396283 [details] Test case Description of problem: Inside my CPG application, The confchg callback is called with 'dead' members: [debug] cpg member node 3 pid 1132 [debug] cpg member node 3 pid 14640 for example process 1132 does not exists any longer on node 3. Version-Release number of selected component (if applicable): TRUNK How reproducible: We have reliable reproducer in attachment. Steps to Reproduce: 1. gcc -Wall cpgtest.c $(shell pkg-config --cflags --libs libcpg libcoroipcc) -o cpgtest 2. keep it run Actual results: # cpgtest ... starting cpgtest calling cpg_initialize calling cpg_join starting main loop (hangs here) Expected results: Never hang Additional info: Taken from OpenAIS mailing list
Created attachment 396287 [details] Proposed patch Cpg join with undelivered leave message Patch handles situation, when on one node, one process: - join cpg - do same actions - leave cpg - join cpg again Following sequence can (racy) end with broken process_info list. To solve this problem, one more check is done in message_handler_req_lib_cpg_join so if process_info with same pid and group as new join request exists, CPG_ERR_EXIST is returned.
works - no more ghost members. But how can i handle CPG_ERR_EXIST correctly? Simply call join again seems to work: while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN || result == CPG_ERR_EXIST ) { printf("cpg_join returned %d\n", result); sleep (1); } or is there a better way?
(In reply to comment #2) > works - no more ghost members. > > But how can i handle CPG_ERR_EXIST correctly? > Simply call join again seems to work: > > while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN || > result == CPG_ERR_EXIST ) { > printf("cpg_join returned %d\n", result); > sleep (1); > } > > or is there a better way? Hi, thanks for very good news. About handling. From my point of view, returning CPG_ERR_EXIST is not best way, I will "rework" patch to return CS_ERR_TRY_AGAIN because this is exactly what we need to return in such situations.
Created attachment 396497 [details] Proposed patch - returns err_try_again Better version of patch, which return CPG_ERR_TRY_AGAIN rather than ERR_EXISTS.
I still get CPG_ERR_TRY_AGAIN sometimes (quite seldom - after running 10 minutes).
(In reply to comment #5) > I still get CPG_ERR_TRY_AGAIN sometimes (quite seldom - after running 10 > minutes). Ya, thats correct. I think it's better than CPG_ERR_EXIST. Or do you mean some different situation?
Sorry, I still get CPG_ERR_EXIST sometimes.
(In reply to comment #7) > Sorry, I still get CPG_ERR_EXIST sometimes. Ya, this can happen when you call cpg_join with same pid/nodeid/group_name more than once. I hope it doesn't happening in test you sent (it shouldn't).
(In reply to comment #8) > (In reply to comment #7) > > Sorry, I still get CPG_ERR_EXIST sometimes. > > Ya, > this can happen when you call cpg_join with same pid/nodeid/group_name more > than once. I hope it doesn't happening in test you sent (it shouldn't). Ok, I must correct myself. It can really happen, and it's because how coroipc is made. What happened in your test case: --- your app + cpg + ipc lib --- - cpg_init + join + ... - cpg_finalize -> coroipcc_service_disconnect -> close fd - cpg_init + join -> error Now finally corosync realize, that your previous fd is closed so calls cpg_lib_exit_fn and this will remove previous cpg_pd from list. I'm not sure, if this need to be handled somehow and if yes, how exactly. Steve, what's your opinion in such case?
clone this bz as a new bug related to this separate finalize followed by join issue. I believe the original problem of stale cpg groups is fixed by your patches. Regards -steve
Bug cloned as https://bugzilla.redhat.com/show_bug.cgi?id=569525
Patch is merged in SVN Trunk, so closing this bug as upstream.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: In rare circumstances, an invalid CPG member was delivered in a configuration change callback.