Red Hat Bugzilla – Bug 1281218
More redundant initial join logic to avoid becoming a fake coordinator
Last modified: 2015-12-10 21:46:45 EST
If the very initial JGroups discovery packet is lost, it is never recovered by the current GMS join logic. The node will be a standalone coordinator then merges after several minutes.
This can happen if a new node reside in another network segment and a switch between the segments requires some time to establish a new multicast route. Currently, there is no enough time between IGMP join (by MulticastSocket#joinGroup()) and the JGroups discovery packet and the later is lost in such a network environment. Because the number of nodes can be very large, configuring a static route in the switch is not reasonable.
Specifically, in method org.jgroups.protocols.pbcast.ClientGmsImpl#joinInternal(), part of gms.getDownProtocol().down(Event.FIND_INITIAL_MBRS_EVT) is outside of the retry loop of GMS.max_join_attempts and GMS.join_timeout.