Bug 1163573
| Summary: | Coordinator promotion is keep failing successively, with massive GCs in a long loop | ||
|---|---|---|---|
| Product: | [JBoss] JBoss Data Grid 6 | Reporter: | Osamu Nagano <onagano> |
| Component: | Infinispan | Assignee: | William Burns <wburns> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Martin Gencur <mgencur> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.3.1 | CC: | dberinde, dmehra, dstahl, jdg-bugs, rmacor |
| Target Milestone: | ER7 | ||
| Target Release: | 6.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Previously in Red Hat JBoss Data Grid, for a large cluster with many caches, the CacheStatusResponse map on the new coordinator consumed an unexpectedly large amount of memory. As a result, the JVM garbage collection has extra load when selecting a new coordinator for the cluster.
This issue is now resolved in JBoss Data Grid 6.4 by the allowing the caching of various instances in the serializaton context. Reuse results in consistent hashes and repeated cache join information.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Created attachment 956913 [details]
clustered.conf
William Burns <wburns> updated the status of jira ISPN-4979 to Coding In Progress We got several coordinator's trace log files. From "Attempting to ..." to "Initializing rebalance ...", about 36 minutes and 3 minutes were taken respectively.
~~~
2014-10-09 16:23:51.505 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Received new cluster view: 14, isCoordinator = true, becameCoordinator = true
2014-10-09 16:23:51.505 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Recovering running caches in the cluster
2014-10-09 16:23:51.507 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Attempting to execute command on self: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=isjpissc96p1088_inst01/clustered(s1), joinInfo=null, topologyId=0, current
2014-10-09 17:00:12.834 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (undefined) Initializing rebalance policy for cache SYSTEM_CONFIG_001_B, pre-existing partitions are [CacheTopology{id=24, currentCH=DefaultConsistentHash{numSegments=1600, numOwners=2, mem
...
~~~
~~~
2014-11-11 15:29:21.211 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Received new cluster view: 43, isCoordinator = true, becameCoordinator = true
2014-11-11 15:29:21.211 DEBUG [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Recovering running caches in the cluster
2014-11-11 15:29:21.213 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Attempting to execute command on self: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=isjpissc96p1088_inst01/clustered(s1), joinInfo=null, topologyId=0,
2014-11-11 15:32:36.881 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (infinispan 6-7596) Initializing rebalance policy for cache SYSTEM_CONFIG_001_B, pre-existing partitions are [CacheTopology{id=43, currentCH=DefaultConsistentHash{numSegments=1600, numOwner
...
~~~
Another impact is to dump a cluster member list to the log file. For example, only 281 lines of the same category's log file reached 135 MB. The customer is already trying to reduce the output.
In addition to reduce the returning object size, multi-threading the loop would be helpful. Is this possible? Dan Berindei <dberinde> updated the status of jira ISPN-5019 to Coding In Progress |
The customer is testing a coordinator failover scenario. After shutting down a machine which is hosting the 1st coordinator, the 2nd coordinator was chosen but it was dropped from the cluster soon because of a massive GC. Then the 3rd coordinator was chosen but it failed again. Then 4th was chosen but failed, and so on. They are tuning GC parameters now and looking for another improvement also. We found that a loop of line 220 of ClusterTopologyManagerImpl is very costly from our log analysis. Under their large cluster configuration, this loop can be very long in a single thread. core/src/main/java/org/infinispan/topology/ClusterTopologyManagerImpl.java 216 if ((isCoordinator && mergeView) || becameCoordinator) { 217 try { 218 Map<String, List<CacheTopology>> clusterCacheMap = recoverClusterStatus(newViewId); 219 220 for (Map.Entry<String, List<CacheTopology>> entry : clusterCacheMap.entrySet()) { 221 String cacheName = entry.getKey(); 222 List<CacheTopology> topologyList = entry.getValue(); 223 try { 224 updateCacheStatusAfterMerge(cacheName, transport.getMembers(), topologyList); 225 } catch (Exception e) { 226 log.failedToRecoverCacheState(cacheName, e); 227 } 228 } 229 } catch (InterruptedException e) { 230 log.tracef("Cluster state recovery interrupted because the coordinator is shutting down"); And the return object of line 218, clusterCacheMap, seems very large itself, causing excessive GCs. We need a review by engineering. Environment: JDG 6.3.1 cluster, 40 hosts and each has 4 instances. 137 caches defined, each has 1600 segments. Attachments: clustered.xml clustered.conf host-ip-mapping.txt failing-coordinators.txt (extracted log messages) gc-and-server-logs.zip (log files from 2 servers) inst01/10.9.64.73 -- the 2nd coordinator inst01/10.9.64.143 -- the 3rd coordinator Note that the 1st coordinator was killed at at 2014-10-29 16:21:41.