Bug 302341 - token lost in commit state results in rejection of necessary join messages
Summary: token lost in commit state results in rejection of necessary join messages
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.1
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Steven Dake
QA Contact:
URL:
Whiteboard:
: 246291 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-09-23 19:51 UTC by Steven Dake
Modified: 2018-10-19 19:51 UTC (History)
5 users (show)

Fixed In Version: RHBA-2007-0599
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-07 17:00:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
initializes my_new_memb in entrance to commit state instead of recovery state. (1.02 KB, patch)
2007-09-23 20:06 UTC, Steven Dake
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0599 0 normal SHIPPED_LIVE openais bug fix update 2007-10-30 15:16:59 UTC

Description Steven Dake 2007-09-23 19:51:56 UTC
Description of problem:
When a token is lost in the COMMIT state, the processor in the commit state
could incrrectly reject join messages.  This results in a continuous loop of
singleton nodes forming without forming of a proper membership.

Version-Release number of selected component (if applicable):
80.3-4

How reproducible:
reproduces with mp5 test suite 1 time in 15k iterations.  Also believed to
reproduce easily with QE revolver.

Steps to Reproduce:
1. run current mp5
2. wait for failure
3. or
4. run revolver
5. analyze failure to determine if it is this failure cause
6. the failure can be detected if there are a bunch of single node
configurations over and over at every token timeout.

  
Actual results:
mp5 or revolver fails

Expected results:
mp5 or revolver should not fail

Additional info:
The totem specification is in error in this case.  There are four states which
in order of time go GATHER, COMMIT, RECOVERY, OPERATIONAL.  The protocol sets a
variable new_memb_list in entrance to the recovery state.  This variable is used
to determine if messages should be rejected in the COMMIT state.  Since the
variable is used before it is set, it is invalid.  This usually doesn't cause a
problem because the case for the commit token loss is very difficult to
reproduce.  Once the commit token is lost, the my_new_memb variable usually
contains enough membership information for the protocol to work correctly. 
Sometimes, however, it does not contain the processors which it should indeed
accept join mesages from.

test case on iteration 100k+ without failure of the membership protoccol via mp5.

patch attached to bugzilla.

Comment 1 Steven Dake 2007-09-23 20:06:40 UTC
Created attachment 203561 [details]
initializes my_new_memb in entrance to commit state instead of recovery state.

Comment 2 Kiersten (Kerri) Anderson 2007-09-24 15:12:06 UTC
Devel ACK based on QE is hitting this in testing frequently at this point. 
Patch is available and we can spin the package to include in RC1.

Comment 6 errata-xmlrpc 2007-11-07 17:00:16 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0599.html


Comment 7 Steven Dake 2008-01-02 23:40:04 UTC
*** Bug 246291 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.