Bug 729081

Summary: openais crashes with combo of lossy network and config changes
Product: Red Hat Enterprise Linux 5 Reporter: Steven Dake <sdake>
Component: openaisAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.8CC: cluster-maint, djansa, edamato, jkortus, jruemker, jwest, msvoboda
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openais-0.80.6-34.el5 Doc Type: Bug Fix
Doc Text:
Previously, when OpenAIS was used in a lossy network, and a large number of configuration changes occurred, OpenAIS sometimes terminated unexpectedly. To solve this problem, the underlying source code has been modified, and OpenAIS no longer crashes in the scenario described.
Story Points: ---
Clone Of: 722522 Environment:
Last Closed: 2012-02-21 05:22:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 722522    
Bug Blocks: 727960, 727962, 731457, 731458, 731460    
Attachments:
Description Flags
Backported patch from Corosync
none
2011-09-27-0001-Deliver-all-messages-from-my_high_seq_recieved-to-th none

Comment 2 Jan Friesse 2011-08-17 14:39:50 UTC
Created attachment 518697 [details]
Backported patch from Corosync

Backport of Corosync b8a061ae28e7c874b66fa1d35ab01f53d1d36b42

Comment 8 Jan Friesse 2011-09-20 11:58:07 UTC
Waiting for resolving of https://bugzilla.redhat.com/show_bug.cgi?id=722522

Comment 9 Jan Friesse 2011-09-27 09:02:03 UTC
Created attachment 525057 [details]
2011-09-27-0001-Deliver-all-messages-from-my_high_seq_recieved-to-th


Deliver all messages from my_high_seq_recieved to the last gap

Backport of corosync 2ec4ddb039b310b308a8748c88332155afd62608

This patch passes two test cases:

-------
Test #1
-------
Two node cluster - run cpgbench on each node

modify totemsrp with following defines:
Two test cases:

-------
Test #2
-------
5 node cluster

start 5 nodes randomly at about same time, start 5 nodes randomly at about
same time, wait 10 seconds and attempt to send a message.  If message blocks
on "TRY_AGAIN" likely a message loss has occured.  Wait a few minutes without
cyclng the nodes and see if the TRY_AGAIN state becomes unblocked.

If it doesn't the test case has failed

Signed-off-by: Steven Dake <sdake>
Reviewed-by: Jan Friesse <jfriesse>

Comment 12 Miroslav Svoboda 2011-11-03 18:01:28 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, when OpenAIS was used in a lossy network, and a large number of configuration changes occurred, OpenAIS sometimes terminated unexpectedly. To solve this problem, the underlying source code has been modified, and OpenAIS no longer crashes in the scenario described.

Comment 14 errata-xmlrpc 2012-02-21 05:22:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0180.html

Comment 15 Jan Friesse 2012-05-07 15:16:20 UTC
*** Bug 818644 has been marked as a duplicate of this bug. ***