Bug 619536 - make openais more resilient to delayed multicast packets
Summary: make openais more resilient to delayed multicast packets
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.5
Hardware: All
OS: Linux
urgent
medium
Target Milestone: rc
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 619496
Blocks: 621264 624517 624518
TreeView+ depends on / blocked
 
Reported: 2010-07-29 18:52 UTC by Steven Dake
Modified: 2018-10-27 13:13 UTC (History)
11 users (show)

Fixed In Version: openais-0.80.6-27.el5
Doc Type: Bug Fix
Doc Text:
OpenAIS has been enabled to work in network environments wherein multicast messages are slightly delayed when compared to token messages.
Clone Of: 619496
Environment:
Last Closed: 2011-01-13 23:57:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
whitetank patch to address problem (6.09 KB, patch)
2010-08-01 19:01 UTC, Steven Dake
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0100 0 normal SHIPPED_LIVE openais bug fix update 2011-01-12 17:21:13 UTC

Description Steven Dake 2010-07-29 18:52:36 UTC
+++ This bug was initially created as a clone of Bug #619496 +++

Description of problem:
Many network switches use a software component to "emulate multicast" by sending a multicast to the switch.  Then the switch sends to every member of the igmp group.  This multicast has extra latency compared to the unicast token (I've measured about 200 usec).  When a processor receives a token, it adds all unreceived messages to a retransmit list.  These retransmits result in extra network bandwidth consumption, when in fact the multicast regular message is not lost, but just delayed.

Version-Release number of selected component (if applicable):
corosync-1.2.3-17.e6

How reproducible:
seems 100% using Cisco infrastructure in RH IT labs

Steps to Reproduce:
1. start two node corosync cluster with totem configured to output debug info
2. run cpgbench
3. see retransmits occur

We can tell multicast is delayed by adding a small delay before transmitting the token.  Another mechanism is to use traffic shaping netem as follows to delay the token:
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 1ms
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 10.16.144.
40/32 flowid 1:3

(note 10.16.144.40 is the target of the next token).
  
Actual results:
when multicast is delayed, totem retransmits messages unnecessarily

Expected results:
no messages should be transmitted unnecessarily

Additional info:

--- Additional comment from sdake on 2010-07-29 12:55:07 EDT ---

For those that don't see this problem in their switches, it is possible to emulate via netem by changing the ip address above to the multicast address (hence introducing a 1ms multicast transmit delay).

--- Additional comment from sdake on 2010-07-29 14:06:42 EDT ---

Created an attachment (id=435364)
patch that introduces the tuneable

Comment 1 Steven Dake 2010-08-01 19:01:31 UTC
Created attachment 435908 [details]
whitetank patch to address problem

Comment 2 Steven Dake 2010-08-01 19:02:34 UTC
Tested with cpgbench on two node cluster via scratch build:

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2644470

Retransmits no longer occur with patch.

Comment 10 Steven Dake 2010-09-02 19:45:58 UTC
Scott had asked for more detail on the 200usec delay issue which is triggering multicast retransmits.

Corosync implements the Totem Single Ring Protocol:
http://www.cs.jhu.edu/~yairamir/tocs.ps

When Totem was first designed, switches were "dumb".  They essentially treated multicast as broadcast.  The switch hardware copied any multicast IP packets to every port on the switch.

Modern switches use a software component called and "IGMP Querier" to first receive a copy of a multicast packet, and then copy it to every port that is within the IGMP group.  This "two-hop" multicast apparently has 200usec delay compared to a "one-hop" udp transmit.

As the Totem authors couldn't predict that modern switching hardware would evolve in this way, there is a gap in the specification.

Now a little detail about how totem works.

A membership protocol composes a logical ring based upon the membership determined by a membership algorithm.

In this example, there are 3 nodes, A, B, and C.  A has 4 messages to transmit, B has 6 messages to transmit, C has 2 messages to transmit.

A token acts as a global time stamp to provide total order messaging and recover lost messages.

Processor A originates a token and sends it to B with seq=0.  B has 6 messages to transmit, so it multicasts these messages then forwards a token to C with seq=6.  C receives the unicast UDP token and multicast with sequence numbers 0,1.  Seq2-6 are delayed (sitting in the IGMP querier component).  In early switch designs, there may be some reordering but there was never delay compared to the unicasted token.  Delay is permitted by UDP/IP, however it did not happen in early switch designs, so the Totem designers did not handle that scenario.  Continuing the example, when a token is received, totem flushes the multicast sockets (to receive any multicast messages prior to receiving the token).  Then node C processes the token by adding the messages it thinks are missing (3-5) to the retransmit list.  Node C delivers 0, 1 to the application.  Node C then transmits its two pending messages (6,7) and forwards the token to node A with seq=8, retransmit=2,3,4,5.  Node A has no copies of any of these messages because they have been delayed in the IGMP querier so does no retransmission.  It does however transmit its 4 pending messages (8,9,10,11) and forward the token to B with seq=12, retransmit=2,3,4,5,6,7.  B has a copy of 0-5 (it originated them), so it retransmits these messages.  By this time, A and C have already likely received these messages from the igmp querier, so the extra retransmissions are pointless and wasted.

The patch in this bugzilla works by keeping a count of the number of times a message is missing.  If a message is missing 5 times, (meaning the token rotated around the ring 5 times), it is safe to assume the igmp querier or one of the target ports dropped the message because of overload and then adds the message to the retransmission list.

The trigger for adding to the retransmit list in original totem is token receive + messages missing (in this case, messages sitting inside the igmp querier are considered missing, even though they are just delayd).

The new trigger post patch for adding to retransmit list is to only add to the retransmit list when a message appears missing from a local node for 5 token rotations which is plenty of time for those delayed multicast packets to be delivered to the port.

Comment 18 Douglas Silas 2011-01-11 23:11:39 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
OpenAIS has been enabled to work in network environments wherein multicast messages are slightly delayed when compared to token messages.

Comment 20 errata-xmlrpc 2011-01-13 23:57:14 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0100.html


Note You need to log in before you can comment on or make changes to this bug.