Red Hat Bugzilla – Bug 302341
token lost in commit state results in rejection of necessary join messages
Last modified: 2016-04-26 09:42:34 EDT
Description of problem:
When a token is lost in the COMMIT state, the processor in the commit state
could incrrectly reject join messages. This results in a continuous loop of
singleton nodes forming without forming of a proper membership.
Version-Release number of selected component (if applicable):
reproduces with mp5 test suite 1 time in 15k iterations. Also believed to
reproduce easily with QE revolver.
Steps to Reproduce:
1. run current mp5
2. wait for failure
4. run revolver
5. analyze failure to determine if it is this failure cause
6. the failure can be detected if there are a bunch of single node
configurations over and over at every token timeout.
mp5 or revolver fails
mp5 or revolver should not fail
The totem specification is in error in this case. There are four states which
in order of time go GATHER, COMMIT, RECOVERY, OPERATIONAL. The protocol sets a
variable new_memb_list in entrance to the recovery state. This variable is used
to determine if messages should be rejected in the COMMIT state. Since the
variable is used before it is set, it is invalid. This usually doesn't cause a
problem because the case for the commit token loss is very difficult to
reproduce. Once the commit token is lost, the my_new_memb variable usually
contains enough membership information for the protocol to work correctly.
Sometimes, however, it does not contain the processors which it should indeed
accept join mesages from.
test case on iteration 100k+ without failure of the membership protoccol via mp5.
patch attached to bugzilla.
Created attachment 203561 [details]
initializes my_new_memb in entrance to commit state instead of recovery state.
Devel ACK based on QE is hitting this in testing frequently at this point.
Patch is available and we can spin the package to include in RC1.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
*** Bug 246291 has been marked as a duplicate of this bug. ***