Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 614219 - token timer is reset on each received retransmitted token resulting in membership meltdown in some conditions
token timer is reset on each received retransmitted token resulting in member...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
6.0
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Steven Dake
Cluster QE
:
Depends On:
Blocks: 614222
  Show dependency treegraph
 
Reported: 2010-07-13 17:34 EDT by Steven Dake
Modified: 2016-04-26 11:02 EDT (History)
1 user (show)

See Also:
Fixed In Version: corosync-1.2.3-13.el6
Doc Type: Bug Fix
Doc Text:
An internal timer variable was reset on each token retransmission rather than only on original token transmission; this has been fixed in this updated package.
Story Points: ---
Clone Of:
: 614222 (view as bug list)
Environment:
Last Closed: 2010-11-10 17:07:23 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to fix problem (925 bytes, patch)
2010-07-13 17:42 EDT, Steven Dake
no flags Details | Diff

  None (edit)
Description Steven Dake 2010-07-13 17:34:59 EDT
Description of problem:
The totem specification is clear:
When a retransmitted token is received, it should be dropped

When a new token is received, it should reset the token timeout

This enables the timers related to the token expiration to happen at about the same time.  In the case where the timer is reset on each token retransmission, it is possible for some nodes to be in operational (because they keep reseting the token loss timeout) while other nodes have detected a failure.  A token loss should be detected by all nodes when not having received the token.  A retransmitted token keeps resetting the token timeout.

To see this in practice, consider a 4 node cluster with token=5000 (5sec) and retransmit rate of 1.5 seconds.  One of the nodes will still be in operational because there is a cascade of token loss events that occur from n1 (waits 5 seconds), (waits 5 seconds) to n2, to (waits 5 seconds) n3 intervals.  When reaching the 3rd node, the 3rd node thinks everything is fine when in fact it has failed to receive its token within its allotted timeout, violating the proof of the algorithm...

Version-Release number of selected component (if applicable):
corosync-1.2.3-12.el6

How reproducible:
100%

Steps to Reproduce:
1.start 4 node corosync cluster
2.set token=5000, consensus=7000, join=60
3.ctrl-z one of the four nodes (ctrl-c is different, it sends a special message to exit the node)
  
Actual results:
the membership protocol melts down and bad things happen (tm)

Expected results:
token loss is detected by all nodes reasonably.

Additional info:
1 liner patch
Comment 1 Steven Dake 2010-07-13 17:42:58 EDT
Created attachment 431603 [details]
patch to fix problem
Comment 3 Nate Straz 2010-08-13 14:19:05 EDT
Verified w/ corosync-1.2.3-17.el6 using the steps to reproduce.
Comment 4 releng-rhel@redhat.com 2010-11-10 17:07:23 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.
Comment 5 Douglas Silas 2011-01-11 18:12:30 EST
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
An internal timer variable was reset on each token retransmission rather than only on original token transmission; this has been fixed in this updated package.

Note You need to log in before you can comment on or make changes to this bug.