Bug 619565 - receipt of out of order regular message can result in token loss
Summary: receipt of out of order regular message can result in token loss
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.0
Hardware: All
OS: Linux
Target Milestone: rc
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
Keywords: OtherQA
Depends On:
Blocks: 619570
TreeView+ depends on / blocked
Reported: 2010-07-29 20:22 UTC by Steven Dake
Modified: 2016-04-26 15:08 UTC (History)
3 users (show)

The receipt of out-of-order messages could have resulted in token loss.
Clone Of:
: 619570 (view as bug list)
Last Closed: 2010-11-15 13:53:55 UTC

Attachments (Terms of Use)
patch to fix the problem (401 bytes, patch)
2010-07-29 20:27 UTC, Steven Dake
no flags Details | Diff

Description Steven Dake 2010-07-29 20:22:59 UTC
Description of problem:
On 06/17/2010 07:16 PM, Tim Beale wrote:
> Hi,
> I'm running corosync on a setup where corosync packets are getting delayed and
> lost. I'm seeing corosync enter recovery mode repeatedly, which is then causing
> other problems for us. (We're running trunk as at revision 2569 (8 Dec 09), so
> some of these flow-on problems may already be fixed.)
> Corosync entering recovery mode repeatedly doesn't look like it's fixed on the
> latest trunk though. The problem is corosync is canceling its token retransmit
> timeout prematurely in message_handler_mcast().
> Corosync in this setup is getting some mcast packets received out of order. So
> corosync receives a mcast message with a lower seq than the last token it sent
> out and stops its token retransmit timer. If the token it just sent is lost,
> then it doesn't retransmit the token. The token timeout occurs and corosync
> enters gather/commit/recovery.
> I think the message_handler_mcast() code should also check the seq of the mcast
> message before stopping the retransmit timer (see attached patch). You can only
> guarantee the last token sent was successfully received if another node sends a
> mcast message with a higher seq.
> Does anyone see any problems with this patch?

Missed your email - sorry for long delay.

Thanks for pointing out the problem - you found a problem in the totem spec!  Your logic is sound - showing a good understanding of how totem works..  

A simpler solution altogether may just be to not cancel the token retransmit timer on receipt of a regular message.  I can see no good reason to cancel that timer, other then as a micro optimization (at the expense of the comparisons for checking the seqid and ringid - looks like a wash).

The patch you submitted doesn't handle rollover of the token (ie: when it reaches boundary conditions in the integer).


Version-Release number of selected component (if applicable):

How reproducible:
not sure

Steps to Reproduce:
1. not sure
Actual results:
token lost if out of order or delayed packets from nodes prior to this node are received shortly after a token is originated and lost on the network.  

Expected results:
no token loss in this situation

Additional info:

Comment 1 Steven Dake 2010-07-29 20:27:40 UTC
Created attachment 435391 [details]
patch to fix the problem

Comment 4 Steven Dake 2010-07-29 21:28:55 UTC
scratch build:


Comment 7 releng-rhel@redhat.com 2010-11-15 13:53:55 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Comment 8 Douglas Silas 2011-01-11 23:11:10 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    New Contents:
The receipt of out-of-order messages could have resulted in token loss.

Note You need to log in before you can comment on or make changes to this bug.