RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 619565 - receipt of out of order regular message can result in token loss
Summary: receipt of out of order regular message can result in token loss
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.0
Hardware: All
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 619570
TreeView+ depends on / blocked
 
Reported: 2010-07-29 20:22 UTC by Steven Dake
Modified: 2016-04-26 15:08 UTC (History)
3 users (show)

Fixed In Version: corosync-1.2.3-18.el6
Doc Type: Bug Fix
Doc Text:
The receipt of out-of-order messages could have resulted in token loss.
Clone Of:
: 619570 (view as bug list)
Environment:
Last Closed: 2010-11-15 13:53:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch to fix the problem (401 bytes, patch)
2010-07-29 20:27 UTC, Steven Dake
no flags Details | Diff

Description Steven Dake 2010-07-29 20:22:59 UTC
Description of problem:
On 06/17/2010 07:16 PM, Tim Beale wrote:
> Hi,
> 
> I'm running corosync on a setup where corosync packets are getting delayed and
> lost. I'm seeing corosync enter recovery mode repeatedly, which is then causing
> other problems for us. (We're running trunk as at revision 2569 (8 Dec 09), so
> some of these flow-on problems may already be fixed.)
> 
> Corosync entering recovery mode repeatedly doesn't look like it's fixed on the
> latest trunk though. The problem is corosync is canceling its token retransmit
> timeout prematurely in message_handler_mcast().
> 
> Corosync in this setup is getting some mcast packets received out of order. So
> corosync receives a mcast message with a lower seq than the last token it sent
> out and stops its token retransmit timer. If the token it just sent is lost,
> then it doesn't retransmit the token. The token timeout occurs and corosync
> enters gather/commit/recovery.
> 
> I think the message_handler_mcast() code should also check the seq of the mcast
> message before stopping the retransmit timer (see attached patch). You can only
> guarantee the last token sent was successfully received if another node sends a
> mcast message with a higher seq.
> 
> Does anyone see any problems with this patch?
> 

Missed your email - sorry for long delay.

Thanks for pointing out the problem - you found a problem in the totem spec!  Your logic is sound - showing a good understanding of how totem works..  

A simpler solution altogether may just be to not cancel the token retransmit timer on receipt of a regular message.  I can see no good reason to cancel that timer, other then as a micro optimization (at the expense of the comparisons for checking the seqid and ringid - looks like a wash).

The patch you submitted doesn't handle rollover of the token (ie: when it reaches boundary conditions in the integer).

Regards
-steve

Version-Release number of selected component (if applicable):
corosync-1.2.3-17.el6

How reproducible:
not sure

Steps to Reproduce:
1. not sure
2.
3.
  
Actual results:
token lost if out of order or delayed packets from nodes prior to this node are received shortly after a token is originated and lost on the network.  

Expected results:
no token loss in this situation

Additional info:

Comment 1 Steven Dake 2010-07-29 20:27:40 UTC
Created attachment 435391 [details]
patch to fix the problem

Comment 4 Steven Dake 2010-07-29 21:28:55 UTC
scratch build:

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2639667

Comment 7 releng-rhel@redhat.com 2010-11-15 13:53:55 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Comment 8 Douglas Silas 2011-01-11 23:11:10 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The receipt of out-of-order messages could have resulted in token loss.


Note You need to log in before you can comment on or make changes to this bug.