Bug 619570 - receipt of out of order regular message can result in token loss
Summary: receipt of out of order regular message can result in token loss
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais   
(Show other bugs)
Version: 5.5
Hardware: All Linux
Target Milestone: rc
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
Depends On: 619565
TreeView+ depends on / blocked
Reported: 2010-07-29 20:29 UTC by Steven Dake
Modified: 2016-04-26 16:22 UTC (History)
3 users (show)

Fixed In Version: openais-0.80.6-25.el5
Doc Type: Bug Fix
Doc Text:
The receipt of out-of-order messages could have resulted in token loss.
Story Points: ---
Clone Of: 619565
Last Closed: 2011-01-13 23:57:39 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
whitetank revision 2151 to fix problem (387 bytes, patch)
2010-07-29 20:41 UTC, Steven Dake
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0100 normal SHIPPED_LIVE openais bug fix update 2011-01-12 17:21:13 UTC

Description Steven Dake 2010-07-29 20:29:27 UTC
+++ This bug was initially created as a clone of Bug #619565 +++

Description of problem:
On 06/17/2010 07:16 PM, Tim Beale wrote:
> Hi,
> I'm running corosync on a setup where corosync packets are getting delayed and
> lost. I'm seeing corosync enter recovery mode repeatedly, which is then causing
> other problems for us. (We're running trunk as at revision 2569 (8 Dec 09), so
> some of these flow-on problems may already be fixed.)
> Corosync entering recovery mode repeatedly doesn't look like it's fixed on the
> latest trunk though. The problem is corosync is canceling its token retransmit
> timeout prematurely in message_handler_mcast().
> Corosync in this setup is getting some mcast packets received out of order. So
> corosync receives a mcast message with a lower seq than the last token it sent
> out and stops its token retransmit timer. If the token it just sent is lost,
> then it doesn't retransmit the token. The token timeout occurs and corosync
> enters gather/commit/recovery.
> I think the message_handler_mcast() code should also check the seq of the mcast
> message before stopping the retransmit timer (see attached patch). You can only
> guarantee the last token sent was successfully received if another node sends a
> mcast message with a higher seq.
> Does anyone see any problems with this patch?

Missed your email - sorry for long delay.

Thanks for pointing out the problem - you found a problem in the totem spec!  Your logic is sound - showing a good understanding of how totem works..  

A simpler solution altogether may just be to not cancel the token retransmit timer on receipt of a regular message.  I can see no good reason to cancel that timer, other then as a micro optimization (at the expense of the comparisons for checking the seqid and ringid - looks like a wash).

The patch you submitted doesn't handle rollover of the token (ie: when it reaches boundary conditions in the integer).


Version-Release number of selected component (if applicable):

How reproducible:
not sure

Steps to Reproduce:
1. not sure
Actual results:
token lost if out of order or delayed packets from nodes prior to this node are received shortly after a token is originated and lost on the network.  

Expected results:
no token loss in this situation

Additional info:

--- Additional comment from sdake@redhat.com on 2010-07-29 16:27:40 EDT ---

Created an attachment (id=435391)
patch to fix the problem

Comment 1 Steven Dake 2010-07-29 20:41:19 UTC
Created attachment 435399 [details]
whitetank revision 2151 to fix problem

Comment 5 Douglas Silas 2011-01-11 23:11:06 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    New Contents:
The receipt of out-of-order messages could have resulted in token loss.

Comment 7 errata-xmlrpc 2011-01-13 23:57:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.