Bug 302341 - token lost in commit state results in rejection of necessary join messages
token lost in commit state results in rejection of necessary join messages
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.1
All Linux
high Severity high
: ---
: ---
Assigned To: Steven Dake
:
: 246291 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-09-23 15:51 EDT by Steven Dake
Modified: 2016-04-26 09:42 EDT (History)
5 users (show)

See Also:
Fixed In Version: RHBA-2007-0599
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 12:00:16 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
initializes my_new_memb in entrance to commit state instead of recovery state. (1.02 KB, patch)
2007-09-23 16:06 EDT, Steven Dake
no flags Details | Diff

  None (edit)
Description Steven Dake 2007-09-23 15:51:56 EDT
Description of problem:
When a token is lost in the COMMIT state, the processor in the commit state
could incrrectly reject join messages.  This results in a continuous loop of
singleton nodes forming without forming of a proper membership.

Version-Release number of selected component (if applicable):
80.3-4

How reproducible:
reproduces with mp5 test suite 1 time in 15k iterations.  Also believed to
reproduce easily with QE revolver.

Steps to Reproduce:
1. run current mp5
2. wait for failure
3. or
4. run revolver
5. analyze failure to determine if it is this failure cause
6. the failure can be detected if there are a bunch of single node
configurations over and over at every token timeout.

  
Actual results:
mp5 or revolver fails

Expected results:
mp5 or revolver should not fail

Additional info:
The totem specification is in error in this case.  There are four states which
in order of time go GATHER, COMMIT, RECOVERY, OPERATIONAL.  The protocol sets a
variable new_memb_list in entrance to the recovery state.  This variable is used
to determine if messages should be rejected in the COMMIT state.  Since the
variable is used before it is set, it is invalid.  This usually doesn't cause a
problem because the case for the commit token loss is very difficult to
reproduce.  Once the commit token is lost, the my_new_memb variable usually
contains enough membership information for the protocol to work correctly. 
Sometimes, however, it does not contain the processors which it should indeed
accept join mesages from.

test case on iteration 100k+ without failure of the membership protoccol via mp5.

patch attached to bugzilla.
Comment 1 Steven Dake 2007-09-23 16:06:40 EDT
Created attachment 203561 [details]
initializes my_new_memb in entrance to commit state instead of recovery state.
Comment 2 Kiersten (Kerri) Anderson 2007-09-24 11:12:06 EDT
Devel ACK based on QE is hitting this in testing frequently at this point. 
Patch is available and we can spin the package to include in RC1.
Comment 6 errata-xmlrpc 2007-11-07 12:00:16 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0599.html
Comment 7 Steven Dake 2008-01-02 18:40:04 EST
*** Bug 246291 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.