Bug 464020 - recieved flag not set properly in commit token results in lost messages.
recieved flag not set properly in commit token results in lost messages.
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.2
All Linux
medium Severity high
: rc
: ---
Assigned To: Steven Dake
Cluster QE
all messages are not recovered proper...
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-25 17:48 EDT by Steven Dake
Modified: 2016-04-26 10:10 EDT (History)
3 users (show)

See Also:
Fixed In Version: 5.3
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 15:40:47 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Steven Dake 2008-09-25 17:48:09 EDT
Description of problem:
If a commit token is created such sa:

node=1 aru=1 ringid=4
node=2 aru=1ef ringid=8
node=3 aru=1fb ringid=8
node=4 aru=1fe ringid=8

What should happen is node 4 should resend all messages from the lowest aru 1ef to the highest aru 1fe.  It does this through the setting of a received flag in the commit token.  Today this received flag is not always set properly.

What happens now is that node2 will not be delivered messages 1fa-1fe, node 2 will not be delivered messages 1fb-1fe.  This results in message loss and possible corruption of information multicast when using services like CPG or EVS.

Version-Release number of selected component (if applicable):
openais-0.80.3-19.el5

How reproducible:
more reproducible with a larger cluster, but requires manual inspection of the commit tokens.  The keys to reproduction are that every node must be sending traffic and there must be atleast 4 nodes with 1 node being killed/restarted.

Could result in a segfault, but I'm not certain about this.  Does not fix the checkpoint bug.
Definately violates EVS.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
messages are lost.

Expected results:
messages should not be lost.

Additional info:

patch to fix problem in hand and has passed Andrew Beekhof's crm testing suite which verifies messages are correctly sent for 500 iterations including node kills/restarts.

Note You need to log in before you can comment on or make changes to this bug.