Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1163573

Summary:	Coordinator promotion is keep failing successively, with massive GCs in a long loop
Product:	[JBoss] JBoss Data Grid 6	Reporter:	Osamu Nagano <onagano>
Component:	Infinispan	Assignee:	William Burns <wburns>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Martin Gencur <mgencur>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.3.1	CC:	dberinde, dmehra, dstahl, jdg-bugs, rmacor
Target Milestone:	ER7
Target Release:	6.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously in Red Hat JBoss Data Grid, for a large cluster with many caches, the CacheStatusResponse map on the new coordinator consumed an unexpectedly large amount of memory. As a result, the JVM garbage collection has extra load when selecting a new coordinator for the cluster. This issue is now resolved in JBoss Data Grid 6.4 by the allowing the caching of various instances in the serializaton context. Reuse results in consistent hashes and repeated cache join information.	Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Osamu Nagano 2014-11-13 03:31:30 UTC

The customer is testing a coordinator failover scenario.  After shutting down a machine which is hosting the 1st coordinator, the 2nd coordinator was chosen but it was dropped from the cluster soon because of a massive GC.  Then the 3rd coordinator was chosen but it failed again.  Then 4th was chosen but failed, and so on.
They are tuning GC parameters now and looking for another improvement also.  We found that a loop of line 220 of ClusterTopologyManagerImpl is very costly from our log analysis.  Under their large cluster configuration, this loop can be very long in a single thread.

core/src/main/java/org/infinispan/topology/ClusterTopologyManagerImpl.java
    216          if ((isCoordinator && mergeView) || becameCoordinator) {
    217             try {
    218                Map<String, List<CacheTopology>> clusterCacheMap = recoverClusterStatus(newViewId);
    219 
    220                for (Map.Entry<String, List<CacheTopology>> entry : clusterCacheMap.entrySet()) {
    221                   String cacheName = entry.getKey();
    222                   List<CacheTopology> topologyList = entry.getValue();
    223                   try {
    224                      updateCacheStatusAfterMerge(cacheName, transport.getMembers(), topologyList);
    225                   } catch (Exception e) {
    226                      log.failedToRecoverCacheState(cacheName, e);
    227                   }
    228                }
    229             } catch (InterruptedException e) {
    230                log.tracef("Cluster state recovery interrupted because the coordinator is shutting down");

And the return object of line 218, clusterCacheMap, seems very large itself, causing excessive GCs.  We need a review by engineering.


Environment:
JDG 6.3.1 cluster, 40 hosts and each has 4 instances.
137 caches defined, each has 1600 segments.

Attachments:
clustered.xml
clustered.conf
host-ip-mapping.txt
failing-coordinators.txt (extracted log messages)
gc-and-server-logs.zip (log files from 2 servers)
    inst01/10.9.64.73  -- the 2nd coordinator
    inst01/10.9.64.143 -- the 3rd coordinator
    Note that the 1st coordinator was killed at at 2014-10-29 16:21:41.

Comment 1 Osamu Nagano 2014-11-13 03:32:19 UTC

Created attachment 956913 [details]
clustered.conf

Comment 7 JBoss JIRA Server 2014-11-17 15:52:31 UTC

William Burns <wburns> updated the status of jira ISPN-4979 to Coding In Progress

Comment 8 Osamu Nagano 2014-11-20 06:08:18 UTC

We got several coordinator's trace log files.  From "Attempting to ..." to "Initializing rebalance ...", about 36 minutes and 3 minutes were taken respectively.

~~~
2014-10-09 16:23:51.505 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Received new cluster view: 14, isCoordinator = true, becameCoordinator = true
2014-10-09 16:23:51.505 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Recovering running caches in the cluster
2014-10-09 16:23:51.507 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Attempting to execute command on self: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=isjpissc96p1088_inst01/clustered(s1), joinInfo=null, topologyId=0, current
2014-10-09 17:00:12.834 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Initializing rebalance policy for cache SYSTEM_CONFIG_001_B, pre-existing partitions are [CacheTopology{id=24, currentCH=DefaultConsistentHash{numSegments=1600, numOwners=2, mem
...
~~~

~~~
2014-11-11 15:29:21.211 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Received new cluster view: 43, isCoordinator = true, becameCoordinator = true
2014-11-11 15:29:21.211 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Recovering running caches in the cluster
2014-11-11 15:29:21.213 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Attempting to execute command on self: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=isjpissc96p1088_inst01/clustered(s1), joinInfo=null, topologyId=0,
2014-11-11 15:32:36.881 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Initializing rebalance policy for cache SYSTEM_CONFIG_001_B, pre-existing partitions are [CacheTopology{id=43, currentCH=DefaultConsistentHash{numSegments=1600, numOwner
...
~~~

Another impact is to dump a cluster member list to the log file.  For example, only 281 lines of the same category's log file reached 135 MB.  The customer is already trying to reduce the output.

Comment 9 Osamu Nagano 2014-11-26 00:26:52 UTC

In addition to reduce the returning object size, multi-threading the loop would be helpful.  Is this possible?

Comment 11 JBoss JIRA Server 2015-03-18 10:11:48 UTC

Dan Berindei <dberinde> updated the status of jira ISPN-5019 to Coding In Progress