The customer is testing a coordinator failover scenario. After shutting down a machine which is hosting the 1st coordinator, the 2nd coordinator was chosen but it was dropped from the cluster soon because of a massive GC. Then the 3rd coordinator was chosen but it failed again. Then 4th was chosen but failed, and so on. They are tuning GC parameters now and looking for another improvement also. We found that a loop of line 220 of ClusterTopologyManagerImpl is very costly from our log analysis. Under their large cluster configuration, this loop can be very long in a single thread. core/src/main/java/org/infinispan/topology/ClusterTopologyManagerImpl.java 216 if ((isCoordinator && mergeView) || becameCoordinator) { 217 try { 218 Map<String, List<CacheTopology>> clusterCacheMap = recoverClusterStatus(newViewId); 219 220 for (Map.Entry<String, List<CacheTopology>> entry : clusterCacheMap.entrySet()) { 221 String cacheName = entry.getKey(); 222 List<CacheTopology> topologyList = entry.getValue(); 223 try { 224 updateCacheStatusAfterMerge(cacheName, transport.getMembers(), topologyList); 225 } catch (Exception e) { 226 log.failedToRecoverCacheState(cacheName, e); 227 } 228 } 229 } catch (InterruptedException e) { 230 log.tracef("Cluster state recovery interrupted because the coordinator is shutting down"); And the return object of line 218, clusterCacheMap, seems very large itself, causing excessive GCs. We need a review by engineering. Environment: JDG 6.3.1 cluster, 40 hosts and each has 4 instances. 137 caches defined, each has 1600 segments. Attachments: clustered.xml clustered.conf host-ip-mapping.txt failing-coordinators.txt (extracted log messages) gc-and-server-logs.zip (log files from 2 servers) inst01/10.9.64.73 -- the 2nd coordinator inst01/10.9.64.143 -- the 3rd coordinator Note that the 1st coordinator was killed at at 2014-10-29 16:21:41.
Created attachment 956913 [details] clustered.conf
William Burns <wburns> updated the status of jira ISPN-4979 to Coding In Progress
We got several coordinator's trace log files. From "Attempting to ..." to "Initializing rebalance ...", about 36 minutes and 3 minutes were taken respectively. ~~~ 2014-10-09 16:23:51.505 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Received new cluster view: 14, isCoordinator = true, becameCoordinator = true 2014-10-09 16:23:51.505 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Recovering running caches in the cluster 2014-10-09 16:23:51.507 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Attempting to execute command on self: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=isjpissc96p1088_inst01/clustered(s1), joinInfo=null, topologyId=0, current 2014-10-09 17:00:12.834 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Initializing rebalance policy for cache SYSTEM_CONFIG_001_B, pre-existing partitions are [CacheTopology{id=24, currentCH=DefaultConsistentHash{numSegments=1600, numOwners=2, mem ... ~~~ ~~~ 2014-11-11 15:29:21.211 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Received new cluster view: 43, isCoordinator = true, becameCoordinator = true 2014-11-11 15:29:21.211 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Recovering running caches in the cluster 2014-11-11 15:29:21.213 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Attempting to execute command on self: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=isjpissc96p1088_inst01/clustered(s1), joinInfo=null, topologyId=0, 2014-11-11 15:32:36.881 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Initializing rebalance policy for cache SYSTEM_CONFIG_001_B, pre-existing partitions are [CacheTopology{id=43, currentCH=DefaultConsistentHash{numSegments=1600, numOwner ... ~~~ Another impact is to dump a cluster member list to the log file. For example, only 281 lines of the same category's log file reached 135 MB. The customer is already trying to reduce the output.
In addition to reduce the returning object size, multi-threading the loop would be helpful. Is this possible?
Dan Berindei <dberinde> updated the status of jira ISPN-5019 to Coding In Progress