568650 – stale CPG members in confchg callback

Bug 568650 - stale CPG members in confchg callback

Summary: stale CPG members in confchg callback

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	openais
Sub Component:
Version:	5.6
Hardware:	All
OS:	All
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	568356 569525
Blocks:	568510
TreeView+	depends on / blocked

Reported:	2010-02-26 09:14 UTC by Jan Friesse
Modified:	2011-01-13 23:56 UTC (History)
CC List:	7 users (show)
Fixed In Version:	openais-0.80.6-28.el5
Doc Type:	Bug Fix
Doc Text:	In rare circumstances, an invalid CPG member was delivered in a configuration change callback.
Clone Of:	568356
Environment:
Last Closed:	2011-01-13 23:56:09 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Backport of patch and SVN#2364 (3.02 KB, patch) 2010-02-26 09:18 UTC, Jan Friesse	no flags	Details \| Diff
Test case backported for whitetank (3.88 KB, text/x-csrc) 2010-02-26 09:20 UTC, Jan Friesse	no flags	Details
Original patch split - part 1 (2.15 KB, patch) 2010-03-01 17:14 UTC, Jan Friesse	no flags	Details \| Diff
Original patch split - part 2 (1.60 KB, application/octet-stream) 2010-03-01 17:16 UTC, Jan Friesse	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0100	0	normal	SHIPPED_LIVE	openais bug fix update	2011-01-12 17:21:13 UTC

Description Jan Friesse 2010-02-26 09:14:45 UTC

+++ This bug was initially created as a clone of Bug #568356 +++

Created an attachment (id=396283)
Test case

Description of problem:
Inside my CPG application, The confchg callback is called with 'dead'
members:

[debug] cpg member node 3 pid 1132
[debug] cpg member node 3 pid 14640

for example process 1132 does not exists any longer on node 3.


Version-Release number of selected component (if applicable):
TRUNK

How reproducible:
We have reliable reproducer in attachment.

Steps to Reproduce:
1. gcc -Wall cpgtest.c $(shell pkg-config --cflags --libs libcpg libcoroipcc) -o cpgtest
2. keep it run
  
Actual results:
# cpgtest
...
starting cpgtest
calling cpg_initialize
calling cpg_join
starting main loop (hangs here)

Expected results:
Never hang

Additional info:
Taken from OpenAIS mailing list

--- Additional comment from jfriesse on 2010-02-25 09:53:26 EST ---

Created an attachment (id=396287)
Proposed patch

Cpg join with undelivered leave message

Patch handles situation, when on one node, one process:
- join cpg
- do same actions
- leave cpg
- join cpg again

Following sequence can (racy) end with broken process_info list.

To solve this problem, one more check is done in
message_handler_req_lib_cpg_join so if process_info with same pid and
group as new join request exists, CPG_ERR_EXIST is returned.

--- Additional comment from dietmar on 2010-02-26 03:41:46 EST ---

works - no more ghost members.

But how can i handle CPG_ERR_EXIST correctly? 
Simply call join again seems to work:

	while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN ||
		result == CPG_ERR_EXIST ) { 
		printf("cpg_join returned %d\n", result);
		sleep (1);
	}

or is there a better way?

--- Additional comment from jfriesse on 2010-02-26 03:50:11 EST ---

(In reply to comment #2)
> works - no more ghost members.
> 
> But how can i handle CPG_ERR_EXIST correctly? 
> Simply call join again seems to work:
> 
>  while ((result = cpg_join(handle, &group_name)) == CS_ERR_TRY_AGAIN ||
>   result == CPG_ERR_EXIST ) { 
>   printf("cpg_join returned %d\n", result);
>   sleep (1);
>  }
> 
> or is there a better way?    

Hi,
thanks for very good news.

About handling. From my point of view, returning CPG_ERR_EXIST is not best way, I will "rework" patch to return CS_ERR_TRY_AGAIN because this is exactly what we need to return in such situations.

--- Additional comment from jfriesse on 2010-02-26 04:13:27 EST ---

Created an attachment (id=396497)
Proposed patch - returns err_try_again

Better version of patch, which return CPG_ERR_TRY_AGAIN rather than ERR_EXISTS.

Comment 1 Jan Friesse 2010-02-26 09:18:42 UTC

Created attachment 396500 [details]
Backport of patch and SVN#2364

Back ported version of patch. It also includes SVN#2364, because that one must be applied for make new patch work correctly.

Comment 2 Jan Friesse 2010-02-26 09:20:57 UTC

Created attachment 396501 [details]
Test case backported for whitetank

Comment 3 Steven Dake 2010-02-26 17:04:41 UTC

Honza,

Please keep each commit from trunk as a separate backport and post to the ml rather then merging both.

Regards
-steve

Comment 4 Jan Friesse 2010-03-01 17:14:31 UTC

Created attachment 397130 [details]
Original patch split - part 1

Allow only one connection per node pid.

Comment 5 Jan Friesse 2010-03-01 17:16:29 UTC

Created attachment 397132 [details]
Original patch split - part 2

Cpg join with undelivered leave message.

Comment 9 Douglas Silas 2011-01-11 23:21:07 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In rare circumstances, an invalid CPG member was delivered in a configuration change callback.

Comment 11 errata-xmlrpc 2011-01-13 23:56:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0100.html

Note You need to log in before you can comment on or make changes to this bug.