Bug 302341

Summary: token lost in commit state results in rejection of necessary join messages
Product: Red Hat Enterprise Linux 5 Reporter: Steven Dake <sdake>
Component: openaisAssignee: Steven Dake <sdake>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 5.1CC: cluster-maint, kanderso, nstraz, rkenna, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0599 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-07 17:00:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
initializes my_new_memb in entrance to commit state instead of recovery state. none

Description Steven Dake 2007-09-23 19:51:56 UTC
Description of problem:
When a token is lost in the COMMIT state, the processor in the commit state
could incrrectly reject join messages.  This results in a continuous loop of
singleton nodes forming without forming of a proper membership.

Version-Release number of selected component (if applicable):
80.3-4

How reproducible:
reproduces with mp5 test suite 1 time in 15k iterations.  Also believed to
reproduce easily with QE revolver.

Steps to Reproduce:
1. run current mp5
2. wait for failure
3. or
4. run revolver
5. analyze failure to determine if it is this failure cause
6. the failure can be detected if there are a bunch of single node
configurations over and over at every token timeout.

  
Actual results:
mp5 or revolver fails

Expected results:
mp5 or revolver should not fail

Additional info:
The totem specification is in error in this case.  There are four states which
in order of time go GATHER, COMMIT, RECOVERY, OPERATIONAL.  The protocol sets a
variable new_memb_list in entrance to the recovery state.  This variable is used
to determine if messages should be rejected in the COMMIT state.  Since the
variable is used before it is set, it is invalid.  This usually doesn't cause a
problem because the case for the commit token loss is very difficult to
reproduce.  Once the commit token is lost, the my_new_memb variable usually
contains enough membership information for the protocol to work correctly. 
Sometimes, however, it does not contain the processors which it should indeed
accept join mesages from.

test case on iteration 100k+ without failure of the membership protoccol via mp5.

patch attached to bugzilla.

Comment 1 Steven Dake 2007-09-23 20:06:40 UTC
Created attachment 203561 [details]
initializes my_new_memb in entrance to commit state instead of recovery state.

Comment 2 Kiersten (Kerri) Anderson 2007-09-24 15:12:06 UTC
Devel ACK based on QE is hitting this in testing frequently at this point. 
Patch is available and we can spin the package to include in RC1.

Comment 6 errata-xmlrpc 2007-11-07 17:00:16 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0599.html


Comment 7 Steven Dake 2008-01-02 23:40:04 UTC
*** Bug 246291 has been marked as a duplicate of this bug. ***