1163573 – Coordinator promotion is keep failing successively, with massive GCs in a long loop

Bug 1163573 - Coordinator promotion is keep failing successively, with massive GCs in a long loop

Summary: Coordinator promotion is keep failing successively, with massive GCs in a lon...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ER7
Target Release:	6.4.0
Assignee:	William Burns
QA Contact:	Martin Gencur
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-11-13 03:31 UTC by Osamu Nagano
Modified:	2019-04-16 14:24 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-4979	0	Critical	Resolved	CacheStatusResponse map uses too much memory	2016-04-21 16:37:52 UTC
Red Hat Issue Tracker	ISPN-5019	0	Critical	Resolved	After coordinator change, cache topologies should be installed in parallel	2016-04-21 16:37:53 UTC

Description Osamu Nagano 2014-11-13 03:31:30 UTC

The customer is testing a coordinator failover scenario.  After shutting down a machine which is hosting the 1st coordinator, the 2nd coordinator was chosen but it was dropped from the cluster soon because of a massive GC.  Then the 3rd coordinator was chosen but it failed again.  Then 4th was chosen but failed, and so on.
They are tuning GC parameters now and looking for another improvement also.  We found that a loop of line 220 of ClusterTopologyManagerImpl is very costly from our log analysis.  Under their large cluster configuration, this loop can be very long in a single thread.

core/src/main/java/org/infinispan/topology/ClusterTopologyManagerImpl.java
    216          if ((isCoordinator && mergeView) || becameCoordinator) {
    217             try {
    218                Map<String, List<CacheTopology>> clusterCacheMap = recoverClusterStatus(newViewId);
    219 
    220                for (Map.Entry<String, List<CacheTopology>> entry : clusterCacheMap.entrySet()) {
    221                   String cacheName = entry.getKey();
    222                   List<CacheTopology> topologyList = entry.getValue();
    223                   try {
    224                      updateCacheStatusAfterMerge(cacheName, transport.getMembers(), topologyList);
    225                   } catch (Exception e) {
    226                      log.failedToRecoverCacheState(cacheName, e);
    227                   }
    228                }
    229             } catch (InterruptedException e) {
    230                log.tracef("Cluster state recovery interrupted because the coordinator is shutting down");

And the return object of line 218, clusterCacheMap, seems very large itself, causing excessive GCs.  We need a review by engineering.


Environment:
JDG 6.3.1 cluster, 40 hosts and each has 4 instances.
137 caches defined, each has 1600 segments.

Attachments:
clustered.xml
clustered.conf
host-ip-mapping.txt
failing-coordinators.txt (extracted log messages)
gc-and-server-logs.zip (log files from 2 servers)
    inst01/10.9.64.73  -- the 2nd coordinator
    inst01/10.9.64.143 -- the 3rd coordinator
    Note that the 1st coordinator was killed at at 2014-10-29 16:21:41.

Comment 1 Osamu Nagano 2014-11-13 03:32:19 UTC

Created attachment 956913 [details]
clustered.conf

Comment 7 JBoss JIRA Server 2014-11-17 15:52:31 UTC

William Burns <wburns> updated the status of jira ISPN-4979 to Coding In Progress

Comment 8 Osamu Nagano 2014-11-20 06:08:18 UTC

We got several coordinator's trace log files.  From "Attempting to ..." to "Initializing rebalance ...", about 36 minutes and 3 minutes were taken respectively.

~~~
2014-10-09 16:23:51.505 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Received new cluster view: 14, isCoordinator = true, becameCoordinator = true
2014-10-09 16:23:51.505 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Recovering running caches in the cluster
2014-10-09 16:23:51.507 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Attempting to execute command on self: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=isjpissc96p1088_inst01/clustered(s1), joinInfo=null, topologyId=0, current
2014-10-09 17:00:12.834 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Initializing rebalance policy for cache SYSTEM_CONFIG_001_B, pre-existing partitions are [CacheTopology{id=24, currentCH=DefaultConsistentHash{numSegments=1600, numOwners=2, mem
...
~~~

~~~
2014-11-11 15:29:21.211 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Received new cluster view: 43, isCoordinator = true, becameCoordinator = true
2014-11-11 15:29:21.211 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Recovering running caches in the cluster
2014-11-11 15:29:21.213 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Attempting to execute command on self: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=isjpissc96p1088_inst01/clustered(s1), joinInfo=null, topologyId=0,
2014-11-11 15:32:36.881 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Initializing rebalance policy for cache SYSTEM_CONFIG_001_B, pre-existing partitions are [CacheTopology{id=43, currentCH=DefaultConsistentHash{numSegments=1600, numOwner
...
~~~

Another impact is to dump a cluster member list to the log file.  For example, only 281 lines of the same category's log file reached 135 MB.  The customer is already trying to reduce the output.

Comment 9 Osamu Nagano 2014-11-26 00:26:52 UTC

In addition to reduce the returning object size, multi-threading the loop would be helpful.  Is this possible?

Comment 11 JBoss JIRA Server 2015-03-18 10:11:48 UTC

Dan Berindei <dberinde> updated the status of jira ISPN-5019 to Coding In Progress

Note You need to log in before you can comment on or make changes to this bug.