+++ This bug was initially created as a clone of Bug #619496 +++ Description of problem: Many network switches use a software component to "emulate multicast" by sending a multicast to the switch. Then the switch sends to every member of the igmp group. This multicast has extra latency compared to the unicast token (I've measured about 200 usec). When a processor receives a token, it adds all unreceived messages to a retransmit list. These retransmits result in extra network bandwidth consumption, when in fact the multicast regular message is not lost, but just delayed. Version-Release number of selected component (if applicable): corosync-1.2.3-17.e6 How reproducible: seems 100% using Cisco infrastructure in RH IT labs Steps to Reproduce: 1. start two node corosync cluster with totem configured to output debug info 2. run cpgbench 3. see retransmits occur We can tell multicast is delayed by adding a small delay before transmitting the token. Another mechanism is to use traffic shaping netem as follows to delay the token: tc qdisc add dev eth0 root handle 1: prio tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 1ms tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 10.16.144. 40/32 flowid 1:3 (note 10.16.144.40 is the target of the next token). Actual results: when multicast is delayed, totem retransmits messages unnecessarily Expected results: no messages should be transmitted unnecessarily Additional info: --- Additional comment from sdake on 2010-07-29 12:55:07 EDT --- For those that don't see this problem in their switches, it is possible to emulate via netem by changing the ip address above to the multicast address (hence introducing a 1ms multicast transmit delay). --- Additional comment from sdake on 2010-07-29 14:06:42 EDT --- Created an attachment (id=435364) patch that introduces the tuneable
Created attachment 435908 [details] whitetank patch to address problem
Tested with cpgbench on two node cluster via scratch build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2644470 Retransmits no longer occur with patch.
Scott had asked for more detail on the 200usec delay issue which is triggering multicast retransmits. Corosync implements the Totem Single Ring Protocol: http://www.cs.jhu.edu/~yairamir/tocs.ps When Totem was first designed, switches were "dumb". They essentially treated multicast as broadcast. The switch hardware copied any multicast IP packets to every port on the switch. Modern switches use a software component called and "IGMP Querier" to first receive a copy of a multicast packet, and then copy it to every port that is within the IGMP group. This "two-hop" multicast apparently has 200usec delay compared to a "one-hop" udp transmit. As the Totem authors couldn't predict that modern switching hardware would evolve in this way, there is a gap in the specification. Now a little detail about how totem works. A membership protocol composes a logical ring based upon the membership determined by a membership algorithm. In this example, there are 3 nodes, A, B, and C. A has 4 messages to transmit, B has 6 messages to transmit, C has 2 messages to transmit. A token acts as a global time stamp to provide total order messaging and recover lost messages. Processor A originates a token and sends it to B with seq=0. B has 6 messages to transmit, so it multicasts these messages then forwards a token to C with seq=6. C receives the unicast UDP token and multicast with sequence numbers 0,1. Seq2-6 are delayed (sitting in the IGMP querier component). In early switch designs, there may be some reordering but there was never delay compared to the unicasted token. Delay is permitted by UDP/IP, however it did not happen in early switch designs, so the Totem designers did not handle that scenario. Continuing the example, when a token is received, totem flushes the multicast sockets (to receive any multicast messages prior to receiving the token). Then node C processes the token by adding the messages it thinks are missing (3-5) to the retransmit list. Node C delivers 0, 1 to the application. Node C then transmits its two pending messages (6,7) and forwards the token to node A with seq=8, retransmit=2,3,4,5. Node A has no copies of any of these messages because they have been delayed in the IGMP querier so does no retransmission. It does however transmit its 4 pending messages (8,9,10,11) and forward the token to B with seq=12, retransmit=2,3,4,5,6,7. B has a copy of 0-5 (it originated them), so it retransmits these messages. By this time, A and C have already likely received these messages from the igmp querier, so the extra retransmissions are pointless and wasted. The patch in this bugzilla works by keeping a count of the number of times a message is missing. If a message is missing 5 times, (meaning the token rotated around the ring 5 times), it is safe to assume the igmp querier or one of the target ports dropped the message because of overload and then adds the message to the retransmission list. The trigger for adding to the retransmit list in original totem is token receive + messages missing (in this case, messages sitting inside the igmp querier are considered missing, even though they are just delayd). The new trigger post patch for adding to retransmit list is to only add to the retransmit list when a message appears missing from a local node for 5 token rotations which is plenty of time for those delayed multicast packets to be delivered to the port.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: OpenAIS has been enabled to work in network environments wherein multicast messages are slightly delayed when compared to token messages.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0100.html