Bug 619570 - receipt of out of order regular message can result in token loss
receipt of out of order regular message can result in token loss
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.5
All Linux
low Severity high
: rc
: ---
Assigned To: Steven Dake
Cluster QE
:
Depends On: 619565
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-29 16:29 EDT by Steven Dake
Modified: 2016-04-26 12:22 EDT (History)
3 users (show)

See Also:
Fixed In Version: openais-0.80.6-25.el5
Doc Type: Bug Fix
Doc Text:
The receipt of out-of-order messages could have resulted in token loss.
Story Points: ---
Clone Of: 619565
Environment:
Last Closed: 2011-01-13 18:57:39 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
whitetank revision 2151 to fix problem (387 bytes, patch)
2010-07-29 16:41 EDT, Steven Dake
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0100 normal SHIPPED_LIVE openais bug fix update 2011-01-12 12:21:13 EST

  None (edit)
Description Steven Dake 2010-07-29 16:29:27 EDT
+++ This bug was initially created as a clone of Bug #619565 +++

Description of problem:
On 06/17/2010 07:16 PM, Tim Beale wrote:
> Hi,
> 
> I'm running corosync on a setup where corosync packets are getting delayed and
> lost. I'm seeing corosync enter recovery mode repeatedly, which is then causing
> other problems for us. (We're running trunk as at revision 2569 (8 Dec 09), so
> some of these flow-on problems may already be fixed.)
> 
> Corosync entering recovery mode repeatedly doesn't look like it's fixed on the
> latest trunk though. The problem is corosync is canceling its token retransmit
> timeout prematurely in message_handler_mcast().
> 
> Corosync in this setup is getting some mcast packets received out of order. So
> corosync receives a mcast message with a lower seq than the last token it sent
> out and stops its token retransmit timer. If the token it just sent is lost,
> then it doesn't retransmit the token. The token timeout occurs and corosync
> enters gather/commit/recovery.
> 
> I think the message_handler_mcast() code should also check the seq of the mcast
> message before stopping the retransmit timer (see attached patch). You can only
> guarantee the last token sent was successfully received if another node sends a
> mcast message with a higher seq.
> 
> Does anyone see any problems with this patch?
> 

Missed your email - sorry for long delay.

Thanks for pointing out the problem - you found a problem in the totem spec!  Your logic is sound - showing a good understanding of how totem works..  

A simpler solution altogether may just be to not cancel the token retransmit timer on receipt of a regular message.  I can see no good reason to cancel that timer, other then as a micro optimization (at the expense of the comparisons for checking the seqid and ringid - looks like a wash).

The patch you submitted doesn't handle rollover of the token (ie: when it reaches boundary conditions in the integer).

Regards
-steve

Version-Release number of selected component (if applicable):
corosync-1.2.3-17.el6

How reproducible:
not sure

Steps to Reproduce:
1. not sure
2.
3.
  
Actual results:
token lost if out of order or delayed packets from nodes prior to this node are received shortly after a token is originated and lost on the network.  

Expected results:
no token loss in this situation

Additional info:

--- Additional comment from sdake@redhat.com on 2010-07-29 16:27:40 EDT ---

Created an attachment (id=435391)
patch to fix the problem
Comment 1 Steven Dake 2010-07-29 16:41:19 EDT
Created attachment 435399 [details]
whitetank revision 2151 to fix problem
Comment 5 Douglas Silas 2011-01-11 18:11:06 EST
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The receipt of out-of-order messages could have resulted in token loss.
Comment 7 errata-xmlrpc 2011-01-13 18:57:39 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0100.html

Note You need to log in before you can comment on or make changes to this bug.