Bug 1163573 - Coordinator promotion is keep failing successively, with massive GCs in a long loop
Summary: Coordinator promotion is keep failing successively, with massive GCs in a lon...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.3.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ER7
: 6.4.0
Assignee: William Burns
QA Contact: Martin Gencur
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-11-13 03:31 UTC by Osamu Nagano
Modified: 2019-04-16 14:24 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-4979 0 Critical Resolved CacheStatusResponse map uses too much memory 2016-04-21 16:37:52 UTC
Red Hat Issue Tracker ISPN-5019 0 Critical Resolved After coordinator change, cache topologies should be installed in parallel 2016-04-21 16:37:53 UTC

Description Osamu Nagano 2014-11-13 03:31:30 UTC
The customer is testing a coordinator failover scenario.  After shutting down a machine which is hosting the 1st coordinator, the 2nd coordinator was chosen but it was dropped from the cluster soon because of a massive GC.  Then the 3rd coordinator was chosen but it failed again.  Then 4th was chosen but failed, and so on.
They are tuning GC parameters now and looking for another improvement also.  We found that a loop of line 220 of ClusterTopologyManagerImpl is very costly from our log analysis.  Under their large cluster configuration, this loop can be very long in a single thread.

core/src/main/java/org/infinispan/topology/ClusterTopologyManagerImpl.java
    216          if ((isCoordinator && mergeView) || becameCoordinator) {
    217             try {
    218                Map<String, List<CacheTopology>> clusterCacheMap = recoverClusterStatus(newViewId);
    219 
    220                for (Map.Entry<String, List<CacheTopology>> entry : clusterCacheMap.entrySet()) {
    221                   String cacheName = entry.getKey();
    222                   List<CacheTopology> topologyList = entry.getValue();
    223                   try {
    224                      updateCacheStatusAfterMerge(cacheName, transport.getMembers(), topologyList);
    225                   } catch (Exception e) {
    226                      log.failedToRecoverCacheState(cacheName, e);
    227                   }
    228                }
    229             } catch (InterruptedException e) {
    230                log.tracef("Cluster state recovery interrupted because the coordinator is shutting down");

And the return object of line 218, clusterCacheMap, seems very large itself, causing excessive GCs.  We need a review by engineering.


Environment:
JDG 6.3.1 cluster, 40 hosts and each has 4 instances.
137 caches defined, each has 1600 segments.

Attachments:
clustered.xml
clustered.conf
host-ip-mapping.txt
failing-coordinators.txt (extracted log messages)
gc-and-server-logs.zip (log files from 2 servers)
    inst01/10.9.64.73  -- the 2nd coordinator
    inst01/10.9.64.143 -- the 3rd coordinator
    Note that the 1st coordinator was killed at at 2014-10-29 16:21:41.

Comment 1 Osamu Nagano 2014-11-13 03:32:19 UTC
Created attachment 956913 [details]
clustered.conf

Comment 7 JBoss JIRA Server 2014-11-17 15:52:31 UTC
William Burns <wburns> updated the status of jira ISPN-4979 to Coding In Progress

Comment 8 Osamu Nagano 2014-11-20 06:08:18 UTC
We got several coordinator's trace log files.  From "Attempting to ..." to "Initializing rebalance ...", about 36 minutes and 3 minutes were taken respectively.

~~~
2014-10-09 16:23:51.505 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Received new cluster view: 14, isCoordinator = true, becameCoordinator = true
2014-10-09 16:23:51.505 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Recovering running caches in the cluster
2014-10-09 16:23:51.507 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Attempting to execute command on self: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=isjpissc96p1088_inst01/clustered(s1), joinInfo=null, topologyId=0, current
2014-10-09 17:00:12.834 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Initializing rebalance policy for cache SYSTEM_CONFIG_001_B, pre-existing partitions are [CacheTopology{id=24, currentCH=DefaultConsistentHash{numSegments=1600, numOwners=2, mem
...
~~~

~~~
2014-11-11 15:29:21.211 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Received new cluster view: 43, isCoordinator = true, becameCoordinator = true
2014-11-11 15:29:21.211 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Recovering running caches in the cluster
2014-11-11 15:29:21.213 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Attempting to execute command on self: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=isjpissc96p1088_inst01/clustered(s1), joinInfo=null, topologyId=0,
2014-11-11 15:32:36.881 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Initializing rebalance policy for cache SYSTEM_CONFIG_001_B, pre-existing partitions are [CacheTopology{id=43, currentCH=DefaultConsistentHash{numSegments=1600, numOwner
...
~~~

Another impact is to dump a cluster member list to the log file.  For example, only 281 lines of the same category's log file reached 135 MB.  The customer is already trying to reduce the output.

Comment 9 Osamu Nagano 2014-11-26 00:26:52 UTC
In addition to reduce the returning object size, multi-threading the loop would be helpful.  Is this possible?

Comment 11 JBoss JIRA Server 2015-03-18 10:11:48 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-5019 to Coding In Progress


Note You need to log in before you can comment on or make changes to this bug.