Bug 1372647 - Async cross data center replication leads to OOM error under load
Summary: Async cross data center replication leads to OOM error under load
Keywords:
Status: NEW
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Performance
Version: 6.6.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Tristan Tarrant
QA Contact: Martin Gencur
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-02 09:37 UTC by Martin Gencur
Modified: 2016-09-19 21:11 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug


Attachments (Terms of Use)
Logs and config files (2.91 MB, application/zip)
2016-09-02 09:37 UTC, Martin Gencur
no flags Details

Description Martin Gencur 2016-09-02 09:37:06 UTC
Created attachment 1197102 [details]
Logs and config files

During a write-heave load test JDG throws an OutOfMemoryError on the backup data center. This happens after ~10 minutes of heavy load (6 HotRod client writing without any delay between requests, the overall load is about 1300 reqs/s with 33kB values, write-only).

Description of test scenario:
* 2 data centers (data center LON with nodes A,B; data center NYC with nodes C,D) with two JDG servers in each DC
* six HotRod clients writing data only in LON (33kB values, writing as quickly as possible)
* ASYNC replication between DCs
* JGroups is using multiple site masters set to 2 (all nodes are site masters)

The logs from individual nodes show the following pattern:
1) node C (in receiving data center NYC): [GC (Allocation Failure) [PSYoungGen: 1048576K->56409K(1223168K)] 1133628K->141469K(4019712K), 0.1291044 secs] [Times: user=0.46 sys=0.01, real=0.13 secs]
2) node C: java.lang.OutOfMemoryError: Java heap space
3) node A (in sending data center LON): WARN  [org.jgroups.protocols.TCP] (HotRodServerWorker-3) Discarding message because TCP send_queue is full and hasn't been releasing for 300 ms
4) node A: WARN  [org.jgroups.protocols.TCP] (ConnectionMap.Acceptor [172.18.1.4:7610]) JGRP000006: failed accepting connection from peer: java.net.SocketTimeoutException: Read timed out
5) node A: ERROR [org.jgroups.protocols.relay.RELAY2] (HotRodServerWorker-3) node0/LON: no route to NYC: dropping message

We also created a heap dump on OOM error on node C and the interesting part is following:

Class Name                          | Shallow Heap | Retained Heap
-------------------------------------------------------------------
org.jgroups.util.Table @ 0x7002337c0|          112 | 3,043,681,288
org.infinispan.container.DefaultDataContainer @ 0x700190b50|           56 |   331,759,576
-------------------------------------------------------------------

Note: Overall heap is 4 GB. We keep writing only ten thousand entries, 33kB each, which gives 330 MB overall (this corresponds to the data container value above).

Attaching logs and config files. Nodes edg-perf02, edg-perf03 are nodes A,B from the description above; edg-perf04, edg-perf05 are nodes C,D; nodes edg-perf06, edg-perf07 are nodes with HotRod clients (but only edg-perf06 writes data)


Note You need to log in before you can comment on or make changes to this bug.