Bug 1372647

Summary: Async cross data center replication leads to OOM error under load
Product: [JBoss] JBoss Data Grid 6 Reporter: Martin Gencur <mgencur>
Component: PerformanceAssignee: Tristan Tarrant <ttarrant>
Status: NEW --- QA Contact: Martin Gencur <mgencur>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.6.1CC: afield
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Logs and config files none

Description Martin Gencur 2016-09-02 09:37:06 UTC
Created attachment 1197102 [details]
Logs and config files

During a write-heave load test JDG throws an OutOfMemoryError on the backup data center. This happens after ~10 minutes of heavy load (6 HotRod client writing without any delay between requests, the overall load is about 1300 reqs/s with 33kB values, write-only).

Description of test scenario:
* 2 data centers (data center LON with nodes A,B; data center NYC with nodes C,D) with two JDG servers in each DC
* six HotRod clients writing data only in LON (33kB values, writing as quickly as possible)
* ASYNC replication between DCs
* JGroups is using multiple site masters set to 2 (all nodes are site masters)

The logs from individual nodes show the following pattern:
1) node C (in receiving data center NYC): [GC (Allocation Failure) [PSYoungGen: 1048576K->56409K(1223168K)] 1133628K->141469K(4019712K), 0.1291044 secs] [Times: user=0.46 sys=0.01, real=0.13 secs]
2) node C: java.lang.OutOfMemoryError: Java heap space
3) node A (in sending data center LON): WARN  [org.jgroups.protocols.TCP] (HotRodServerWorker-3) Discarding message because TCP send_queue is full and hasn't been releasing for 300 ms
4) node A: WARN  [org.jgroups.protocols.TCP] (ConnectionMap.Acceptor [172.18.1.4:7610]) JGRP000006: failed accepting connection from peer: java.net.SocketTimeoutException: Read timed out
5) node A: ERROR [org.jgroups.protocols.relay.RELAY2] (HotRodServerWorker-3) node0/LON: no route to NYC: dropping message

We also created a heap dump on OOM error on node C and the interesting part is following:

Class Name                          | Shallow Heap | Retained Heap
-------------------------------------------------------------------
org.jgroups.util.Table @ 0x7002337c0|          112 | 3,043,681,288
org.infinispan.container.DefaultDataContainer @ 0x700190b50|           56 |   331,759,576
-------------------------------------------------------------------

Note: Overall heap is 4 GB. We keep writing only ten thousand entries, 33kB each, which gives 330 MB overall (this corresponds to the data container value above).

Attaching logs and config files. Nodes edg-perf02, edg-perf03 are nodes A,B from the description above; edg-perf04, edg-perf05 are nodes C,D; nodes edg-perf06, edg-perf07 are nodes with HotRod clients (but only edg-perf06 writes data)