Bug 544479

Summary: token timeout should be smaller then consensus timeout
Product: [Fedora] Fedora Reporter: Steven Dake <sdake>
Component: cmanAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: high    
Version: rawhideCC: agk, ccaulfie, cfeist, fdinitto, mbroz, swhiteho
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 544482 (view as bug list) Environment:
Last Closed: 2010-03-02 14:41:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steven Dake 2009-12-05 00:24:29 UTC
Description of problem:
token timeout should be smaller then consensus timeout.  If this is not the case, it is possible for totem  to "split-brain" itself under rare circumstances, especially with larger node counts, resulting in a "disallowed node state" behavior.

During membership, nodes which have not reached consensus are added to a "failed" list when consensus timer expires.

During the membership protocol, it is possible for a node to achieve consensus.  When this happens, there is a check to ignore new join messages with older ring sequence ids then the newly requested ring id.  These new join messages may contain information that is desireable to have in the new membership and the other processors will not form consensus until the processor that is in COMMIT has attempted to form consensus by exchanging join messages.

Unfortunately, the only thing that takes a processor out of commit is if it receives its commit token again, or the token timeout expires.   Under extremely rare circumstances, it is possible for a processor to reject the commit token because it doesn't match its view of membership.

Before accepting a commit token, the processor accepting the commit token verifies the proposed membership matches that of its internal membership.  If at any time after the commit token is created and originated, one of the processors in the commit membership receives a join message (while it is still in the gather state), it will further reject the commit token.

When the token timeout period is 10 seconds (default fedora), a processor in the commit state rejects membership messages until the token timeout expires (because the commit token gets stuck at some processor rejecting it).  Unfortunately at the same time, some other processor has already expired its consensus timer which is 4.8 seconds.  Since the processors in the commit state for 10 seconds didn't participate in consensus gathering, it is determined "failed", delivering a failed processor confchg for every node that accepted the commit token.  It then detects a new processor and forms a proper configuration.

In testing 32 nodes with default parameters I could trigger this case about 1 in 30 times.  To test, I used cman_tool join; fenced; fence_tool join on each of the 32 nodes, then killed 7 nodes in the cluster, and repeated.

Comment 1 Christine Caulfield 2009-12-07 09:49:53 UTC
commit 02a8b8872f59ac4933233aed31b3cfa39cda9db5
Author: Christine Caulfield <ccaulfie>
Date:   Mon Dec 7 09:46:05 2009 +0000

    cman: Make consensus twice token timeout