Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 619565 - receipt of out of order regular message can result in token loss
receipt of out of order regular message can result in token loss
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
All Linux
low Severity high
: rc
: ---
Assigned To: Steven Dake
Cluster QE
: OtherQA
Depends On:
Blocks: 619570
  Show dependency treegraph
Reported: 2010-07-29 16:22 EDT by Steven Dake
Modified: 2016-04-26 11:08 EDT (History)
3 users (show)

See Also:
Fixed In Version: corosync-1.2.3-18.el6
Doc Type: Bug Fix
Doc Text:
The receipt of out-of-order messages could have resulted in token loss.
Story Points: ---
Clone Of:
: 619570 (view as bug list)
Last Closed: 2010-11-15 08:53:55 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
patch to fix the problem (401 bytes, patch)
2010-07-29 16:27 EDT, Steven Dake
no flags Details | Diff

  None (edit)
Description Steven Dake 2010-07-29 16:22:59 EDT
Description of problem:
On 06/17/2010 07:16 PM, Tim Beale wrote:
> Hi,
> I'm running corosync on a setup where corosync packets are getting delayed and
> lost. I'm seeing corosync enter recovery mode repeatedly, which is then causing
> other problems for us. (We're running trunk as at revision 2569 (8 Dec 09), so
> some of these flow-on problems may already be fixed.)
> Corosync entering recovery mode repeatedly doesn't look like it's fixed on the
> latest trunk though. The problem is corosync is canceling its token retransmit
> timeout prematurely in message_handler_mcast().
> Corosync in this setup is getting some mcast packets received out of order. So
> corosync receives a mcast message with a lower seq than the last token it sent
> out and stops its token retransmit timer. If the token it just sent is lost,
> then it doesn't retransmit the token. The token timeout occurs and corosync
> enters gather/commit/recovery.
> I think the message_handler_mcast() code should also check the seq of the mcast
> message before stopping the retransmit timer (see attached patch). You can only
> guarantee the last token sent was successfully received if another node sends a
> mcast message with a higher seq.
> Does anyone see any problems with this patch?

Missed your email - sorry for long delay.

Thanks for pointing out the problem - you found a problem in the totem spec!  Your logic is sound - showing a good understanding of how totem works..  

A simpler solution altogether may just be to not cancel the token retransmit timer on receipt of a regular message.  I can see no good reason to cancel that timer, other then as a micro optimization (at the expense of the comparisons for checking the seqid and ringid - looks like a wash).

The patch you submitted doesn't handle rollover of the token (ie: when it reaches boundary conditions in the integer).


Version-Release number of selected component (if applicable):

How reproducible:
not sure

Steps to Reproduce:
1. not sure
Actual results:
token lost if out of order or delayed packets from nodes prior to this node are received shortly after a token is originated and lost on the network.  

Expected results:
no token loss in this situation

Additional info:
Comment 1 Steven Dake 2010-07-29 16:27:40 EDT
Created attachment 435391 [details]
patch to fix the problem
Comment 4 Steven Dake 2010-07-29 17:28:55 EDT
scratch build:

Comment 7 releng-rhel@redhat.com 2010-11-15 08:53:55 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.
Comment 8 Douglas Silas 2011-01-11 18:11:10 EST
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    New Contents:
The receipt of out-of-order messages could have resulted in token loss.

Note You need to log in before you can comment on or make changes to this bug.