Description of problem:
On 06/17/2010 07:16 PM, Tim Beale wrote:
> I'm running corosync on a setup where corosync packets are getting delayed and
> lost. I'm seeing corosync enter recovery mode repeatedly, which is then causing
> other problems for us. (We're running trunk as at revision 2569 (8 Dec 09), so
> some of these flow-on problems may already be fixed.)
> Corosync entering recovery mode repeatedly doesn't look like it's fixed on the
> latest trunk though. The problem is corosync is canceling its token retransmit
> timeout prematurely in message_handler_mcast().
> Corosync in this setup is getting some mcast packets received out of order. So
> corosync receives a mcast message with a lower seq than the last token it sent
> out and stops its token retransmit timer. If the token it just sent is lost,
> then it doesn't retransmit the token. The token timeout occurs and corosync
> enters gather/commit/recovery.
> I think the message_handler_mcast() code should also check the seq of the mcast
> message before stopping the retransmit timer (see attached patch). You can only
> guarantee the last token sent was successfully received if another node sends a
> mcast message with a higher seq.
> Does anyone see any problems with this patch?
Missed your email - sorry for long delay.
Thanks for pointing out the problem - you found a problem in the totem spec! Your logic is sound - showing a good understanding of how totem works..
A simpler solution altogether may just be to not cancel the token retransmit timer on receipt of a regular message. I can see no good reason to cancel that timer, other then as a micro optimization (at the expense of the comparisons for checking the seqid and ringid - looks like a wash).
The patch you submitted doesn't handle rollover of the token (ie: when it reaches boundary conditions in the integer).
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. not sure
token lost if out of order or delayed packets from nodes prior to this node are received shortly after a token is originated and lost on the network.
no token loss in this situation
Created attachment 435391 [details]
patch to fix the problem
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
The receipt of out-of-order messages could have resulted in token loss.