Bug 464020 - recieved flag not set properly in commit token results in lost messages.
Summary: recieved flag not set properly in commit token results in lost messages.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.2
Hardware: All
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
URL:
Whiteboard: all messages are not recovered proper...
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-09-25 21:48 UTC by Steven Dake
Modified: 2016-04-26 14:10 UTC (History)
3 users (show)

Fixed In Version: 5.3
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 20:40:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Steven Dake 2008-09-25 21:48:09 UTC
Description of problem:
If a commit token is created such sa:

node=1 aru=1 ringid=4
node=2 aru=1ef ringid=8
node=3 aru=1fb ringid=8
node=4 aru=1fe ringid=8

What should happen is node 4 should resend all messages from the lowest aru 1ef to the highest aru 1fe.  It does this through the setting of a received flag in the commit token.  Today this received flag is not always set properly.

What happens now is that node2 will not be delivered messages 1fa-1fe, node 2 will not be delivered messages 1fb-1fe.  This results in message loss and possible corruption of information multicast when using services like CPG or EVS.

Version-Release number of selected component (if applicable):
openais-0.80.3-19.el5

How reproducible:
more reproducible with a larger cluster, but requires manual inspection of the commit tokens.  The keys to reproduction are that every node must be sending traffic and there must be atleast 4 nodes with 1 node being killed/restarted.

Could result in a segfault, but I'm not certain about this.  Does not fix the checkpoint bug.
Definately violates EVS.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
messages are lost.

Expected results:
messages should not be lost.

Additional info:

patch to fix problem in hand and has passed Andrew Beekhof's crm testing suite which verifies messages are correctly sent for 500 iterations including node kills/restarts.


Note You need to log in before you can comment on or make changes to this bug.