Bug 568650 - stale CPG members in confchg callback
Summary: stale CPG members in confchg callback
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.6
Hardware: All
OS: All
low
low
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 568356 569525
Blocks: 568510
TreeView+ depends on / blocked
 
Reported: 2010-02-26 09:14 UTC by Jan Friesse
Modified: 2011-01-13 23:56 UTC (History)
7 users (show)

Fixed In Version: openais-0.80.6-28.el5
Doc Type: Bug Fix
Doc Text:
In rare circumstances, an invalid CPG member was delivered in a configuration change callback.
Clone Of: 568356
Environment:
Last Closed: 2011-01-13 23:56:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Backport of patch and SVN#2364 (3.02 KB, patch)
2010-02-26 09:18 UTC, Jan Friesse
no flags Details | Diff
Test case backported for whitetank (3.88 KB, text/x-csrc)
2010-02-26 09:20 UTC, Jan Friesse
no flags Details
Original patch split - part 1 (2.15 KB, patch)
2010-03-01 17:14 UTC, Jan Friesse
no flags Details | Diff
Original patch split - part 2 (1.60 KB, application/octet-stream)
2010-03-01 17:16 UTC, Jan Friesse
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0100 0 normal SHIPPED_LIVE openais bug fix update 2011-01-12 17:21:13 UTC

Description Jan Friesse 2010-02-26 09:14:45 UTC
+++ This bug was initially created as a clone of Bug #568356 +++

Created an attachment (id=396283)
Test case

Description of problem:
Inside my CPG application, The confchg callback is called with 'dead'
members:

[debug] cpg member node 3 pid 1132
[debug] cpg member node 3 pid 14640

for example process 1132 does not exists any longer on node 3.


Version-Release number of selected component (if applicable):
TRUNK

How reproducible:
We have reliable reproducer in attachment.

Steps to Reproduce:
1. gcc -Wall cpgtest.c $(shell pkg-config --cflags --libs libcpg libcoroipcc) -o cpgtest
2. keep it run
  
Actual results:
# cpgtest
...
starting cpgtest
calling cpg_initialize
calling cpg_join
starting main loop (hangs here)

Expected results:
Never hang

Additional info:
Taken from OpenAIS mailing list

--- Additional comment from jfriesse on 2010-02-25 09:53:26 EST ---

Created an attachment (id=396287)
Proposed patch

Cpg join with undelivered leave message

Patch handles situation, when on one node, one process:
- join cpg
- do same actions
- leave cpg
- join cpg again

Following sequence can (racy) end with broken process_info list.

To solve this problem, one more check is done in
message_handler_req_lib_cpg_join so if process_info with same pid and
group as new join request exists, CPG_ERR_EXIST is returned.

--- Additional comment from dietmar on 2010-02-26 03:41:46 EST ---

works - no more ghost members.

But how can i handle CPG_ERR_EXIST correctly? 
Simply call join again seems to work:

	while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN ||
		result == CPG_ERR_EXIST ) { 
		printf("cpg_join returned %d\n", result);
		sleep (1);
	}

or is there a better way?

--- Additional comment from jfriesse on 2010-02-26 03:50:11 EST ---

(In reply to comment #2)
> works - no more ghost members.
> 
> But how can i handle CPG_ERR_EXIST correctly? 
> Simply call join again seems to work:
> 
>  while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN ||
>   result == CPG_ERR_EXIST ) { 
>   printf("cpg_join returned %d\n", result);
>   sleep (1);
>  }
> 
> or is there a better way?    

Hi,
thanks for very good news.

About handling. From my point of view, returning CPG_ERR_EXIST is not best way, I will "rework" patch to return CS_ERR_TRY_AGAIN because this is exactly what we need to return in such situations.

--- Additional comment from jfriesse on 2010-02-26 04:13:27 EST ---

Created an attachment (id=396497)
Proposed patch - returns err_try_again

Better version of patch, which return CPG_ERR_TRY_AGAIN rather than ERR_EXISTS.

Comment 1 Jan Friesse 2010-02-26 09:18:42 UTC
Created attachment 396500 [details]
Backport of patch and SVN#2364

Back ported version of patch. It also includes SVN#2364, because that one must be applied for make new patch work correctly.

Comment 2 Jan Friesse 2010-02-26 09:20:57 UTC
Created attachment 396501 [details]
Test case backported for whitetank

Comment 3 Steven Dake 2010-02-26 17:04:41 UTC
Honza,

Please keep each commit from trunk as a separate backport and post to the ml rather then merging both.

Regards
-steve

Comment 4 Jan Friesse 2010-03-01 17:14:31 UTC
Created attachment 397130 [details]
Original patch split - part 1

Allow only one connection per node pid.

Comment 5 Jan Friesse 2010-03-01 17:16:29 UTC
Created attachment 397132 [details]
Original patch split - part 2

Cpg join with undelivered leave message.

Comment 9 Douglas Silas 2011-01-11 23:21:07 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In rare circumstances, an invalid CPG member was delivered in a configuration change callback.

Comment 11 errata-xmlrpc 2011-01-13 23:56:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0100.html


Note You need to log in before you can comment on or make changes to this bug.