Description of problem: When a token is lost in the COMMIT state, the processor in the commit state could incrrectly reject join messages. This results in a continuous loop of singleton nodes forming without forming of a proper membership. Version-Release number of selected component (if applicable): 80.3-4 How reproducible: reproduces with mp5 test suite 1 time in 15k iterations. Also believed to reproduce easily with QE revolver. Steps to Reproduce: 1. run current mp5 2. wait for failure 3. or 4. run revolver 5. analyze failure to determine if it is this failure cause 6. the failure can be detected if there are a bunch of single node configurations over and over at every token timeout. Actual results: mp5 or revolver fails Expected results: mp5 or revolver should not fail Additional info: The totem specification is in error in this case. There are four states which in order of time go GATHER, COMMIT, RECOVERY, OPERATIONAL. The protocol sets a variable new_memb_list in entrance to the recovery state. This variable is used to determine if messages should be rejected in the COMMIT state. Since the variable is used before it is set, it is invalid. This usually doesn't cause a problem because the case for the commit token loss is very difficult to reproduce. Once the commit token is lost, the my_new_memb variable usually contains enough membership information for the protocol to work correctly. Sometimes, however, it does not contain the processors which it should indeed accept join mesages from. test case on iteration 100k+ without failure of the membership protoccol via mp5. patch attached to bugzilla.
Created attachment 203561 [details] initializes my_new_memb in entrance to commit state instead of recovery state.
Devel ACK based on QE is hitting this in testing frequently at this point. Patch is available and we can spin the package to include in RC1.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0599.html
*** Bug 246291 has been marked as a duplicate of this bug. ***