Bug 544680

Summary: originating 206 or more messages in recovery causes totem to block
Product: Red Hat Enterprise Linux 5 Reporter: Steven Dake <sdake>
Component: openaisAssignee: Steven Dake <sdake>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.4CC: cluster-maint, dejohnso, djansa, edamato, jwest
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openais-0.80.6-12.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 07:48:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 584559, 584560    
Attachments:
Description Flags
patch that increases retransmit queue array to 16384
none
patch to make range checking asserts use define that recently changed rather then magic numbers none

Description Steven Dake 2009-12-06 00:26:41 UTC
Description of problem:
The totem protocol stops forward progress if 206 or more messages are originated in recovery.  This happens when heavy message traffic occurs and each node requests a large amount of recoveries summing over 206 messages.

There is a circular buffer which contains copy of messages before they are delivered in the normal OPERATIONAL mode.  As all nodes receive a copy of the message, the message is freed from the circular buffer.  A flow control algorithm ensures that the circular array doesn't become too full by stopping new multicast requests from being processed when the buffer becomes too large.

The size of this buffer is 256 entries, which is sufficient for the OPERATIONAL case.  Unfortunately a different circular buffer that operates in the same way is used during the RECOVERY state.  The RECOVERY state requires us to recover all messages that every node in the previous OPERATIONAL state may not have a copy of.  As it does this, each node originates several messages near the end of the configuration for which it has copies, but other nodes indicate they don't have copies.

The flow control algorithm begins to stop new message requests once the the "recovery buffer" contains 206 messages.  The recovery state is then unable to recover all messages and essentially blocks because it can't send new messages to meet the obligations that retransmission of lost messages is complete.

The solution is simply to increase the size of the recovery queue to a more reasonable value.  For technical reasons, the size of the regular queue must also match.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. start 32 node cluster with cman_tool join; fenced; fence_tool join
2. kill 8 nodes in the cluster, repeat
3. fails 10% of the time.
  
Actual results:
fails with last message in output indicating in the RECOVERY state

Expected results:
finishes recovery

Additional info:

Comment 1 Steven Dake 2009-12-06 00:29:55 UTC
note magic number of 206 is 256 (buffer size) - 50 (window size)

code is:
/*
 * don't overflow the RTR sort queue
 */
static void fcc_rtr_limit (
        struct totemsrp_instance *instance,
        struct orf_token *token,
        unsigned int *transmits_allowed)
{
        assert ((QUEUE_RTR_ITEMS_SIZE_MAX - *transmits_allowed - instance->totem_config->window_size) >= 0);
        if (sq_lt_compare (instance->last_released +
                QUEUE_RTR_ITEMS_SIZE_MAX - *transmits_allowed -
                instance->totem_config->window_size,

                        token->seq)) {

                        *transmits_allowed = 0;
        }
}

Comment 2 Steven Dake 2009-12-06 00:35:06 UTC
Created attachment 376382 [details]
patch that increases retransmit queue array to 16384

Comment 3 Steven Dake 2009-12-07 18:29:40 UTC
Created attachment 376742 [details]
patch to make range checking asserts use define that recently changed rather then magic numbers

Comment 6 errata-xmlrpc 2010-03-30 07:48:42 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0180.html