Bug 619496

Summary: make corosync more resilient to delayed multicast packets
Product: Red Hat Enterprise Linux 6 Reporter: Steven Dake <sdake>
Component: corosyncAssignee: Steven Dake <sdake>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.0CC: cluster-maint, jkortus, jwest, ssaha, uwe.knop
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: corosync-1.2.3-22.el6 Doc Type: Bug Fix
Doc Text:
OpenAIS has been enabled to work in network environments wherein multicast messages are slightly delayed when compared to token messages.
Story Points: ---
Clone Of:
: 619536 (view as bug list) Environment:
Last Closed: 2011-05-19 14:24:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 619536, 638592    
Attachments:
Description Flags
patch that introduces the tuneable none

Description Steven Dake 2010-07-29 16:54:11 UTC
Description of problem:
Many network switches use a software component to "emulate multicast" by sending a multicast to the switch.  Then the switch sends to every member of the igmp group.  This multicast has extra latency compared to the unicast token (I've measured about 200 usec).  When a processor receives a token, it adds all unreceived messages to a retransmit list.  These retransmits result in extra network bandwidth consumption, when in fact the multicast regular message is not lost, but just delayed.

Version-Release number of selected component (if applicable):
corosync-1.2.3-17.e6

How reproducible:
seems 100% using Cisco infrastructure in RH IT labs

Steps to Reproduce:
1. start two node corosync cluster with totem configured to output debug info
2. run cpgbench
3. see retransmits occur

We can tell multicast is delayed by adding a small delay before transmitting the token.  Another mechanism is to use traffic shaping netem as follows to delay the token:
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 1ms
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 10.16.144.
40/32 flowid 1:3

(note 10.16.144.40 is the target of the next token).
  
Actual results:
when multicast is delayed, totem retransmits messages unnecessarily

Expected results:
no messages should be transmitted unnecessarily

Additional info:

Comment 1 Steven Dake 2010-07-29 16:55:07 UTC
For those that don't see this problem in their switches, it is possible to emulate via netem by changing the ip address above to the multicast address (hence introducing a 1ms multicast transmit delay).

Comment 2 Steven Dake 2010-07-29 18:06:42 UTC
Created attachment 435364 [details]
patch that introduces the tuneable

Comment 5 Douglas Silas 2011-01-11 23:11:46 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
OpenAIS has been enabled to work in network environments wherein multicast messages are slightly delayed when compared to token messages.

Comment 8 errata-xmlrpc 2011-05-19 14:24:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0764.html