Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 745870 (EDG-106)

Summary:

Data stress performance: Long delays in putting data to a cluster in use by other clients between test iterations

Product:

[JBoss] JBoss Data Grid 5

Reporter:

Richard Achmatowicz <rachmato>

Component:

Infinispan

Assignee:

Default User <jbpapp-maint>

Status:

CLOSED NEXTRELEASE

QA Contact:

Severity:

high

Docs Contact:

Priority:

high

Version:

unspecified

CC:

galder.zamarreno, mlinhard, rachmato, rhusar

Target Milestone:

---

Target Release:

EAP 5.1.0 EDG TP

Hardware:

Unspecified

OS:

Unspecified

URL:

http://jira.jboss.org/jira/browse/EDG-106

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-05-10 13:01:07 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
vmstat-normal.gif	none
vmstat-gcs.gif	none
FullGCs.gif	none
heapPerf17.txt	none
GCs-last.gif	none

Description Richard Achmatowicz 2011-03-22 22:39:36 UTC

project_key: EDG

I've been observing *extremely* long delays when putting data to a cluster which is in use by other clients.

This situation arises during data stress tests: 
- data grid is initially loaded with data
- on first iteration, clients interact with data grid
- before moving to next iteration, a data loader (an ordinary 'new' client) loads <key,value> pairs into a range not used by the existing clients

These data loading phases can take up to one and one half hours for loading a set of key, value pairs which at other times take a few minutes.

For example, in one data stress test, on iteration 1, it took 1 hr 35 mins to load 100,000 recordd:
2011/03/22 16:45:12:459 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Server response within limits, increasing load.
2011/03/22 16:45:12:462 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Desired load reached. New session count: 800
2011/03/22 16:45:12:463 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Starting CacheManager...
2011/03/22 16:45:12:465 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 0.
2011/03/22 16:45:45:877 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 45000 records...
2011/03/22 16:50:25:192 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 50000 records...
2011/03/22 16:53:36:550 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 55000 records...
2011/03/22 16:53:59:098 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 55556 records.
2011/03/22 16:53:59:099 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 1.
2011/03/22 16:54:30:603 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 45000 records...
2011/03/22 16:57:58:570 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 50000 records...
2011/03/22 17:02:26:288 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 55000 records...
2011/03/22 17:03:10:246 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 55556 records.
2011/03/22 17:03:10:246 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 2.
2011/03/22 17:03:52:980 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 45000 records...
2011/03/22 17:09:27:692 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 50000 records...
2011/03/22 17:14:48:885 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 55000 records...
2011/03/22 17:15:23:933 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 55556 records.
2011/03/22 17:15:23:933 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 3.
2011/03/22 17:15:54:999 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 45000 records...
2011/03/22 17:21:36:581 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 50000 records...
2011/03/22 17:27:27:922 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 55000 records...
2011/03/22 17:28:14:176 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 55556 records.
2011/03/22 17:28:14:176 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 4.
2011/03/22 17:29:01:898 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 45000 records...
2011/03/22 17:36:58:897 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 50000 records...
2011/03/22 17:45:01:607 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 55000 records...
2011/03/22 17:45:56:887 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 55556 records.
2011/03/22 17:45:56:887 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 5.
2011/03/22 17:46:51:239 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 45000 records...
2011/03/22 17:53:29:939 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 50000 records...
2011/03/22 17:54:39:642 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 55000 records...
2011/03/22 17:54:47:169 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 55556 records.
2011/03/22 17:54:47:169 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 6.
2011/03/22 17:54:54:588 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 45000 records...
2011/03/22 18:01:01:613 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 50000 records...
2011/03/22 18:06:01:556 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 55000 records...
2011/03/22 18:06:02:859 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 55556 records.
2011/03/22 18:06:02:859 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 7.
2011/03/22 18:06:04:014 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 45000 records...
2011/03/22 18:06:14:840 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 50000 records...
2011/03/22 18:06:27:372 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 55000 records...
2011/03/22 18:06:28:515 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 55556 records.
2011/03/22 18:06:28:515 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 8.
2011/03/22 18:06:29:761 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 45000 records...
2011/03/22 18:06:40:511 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 50000 records...
2011/03/22 18:06:51:166 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 55000 records...
2011/03/22 18:06:52:263 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 55556 records.
2011/03/22 18:06:52:263 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Stopping client...
2011/03/22 18:06:52:264 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Data loading done.

On iteration 2, to load the same 1000,000 records, it took 4 minutes:
2011/03/22 18:07:52:509 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Desired load reached. New session count: 800
2011/03/22 18:07:52:509 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Starting CacheManager...
2011/03/22 18:07:52:512 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 0.
2011/03/22 18:08:02:641 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 60000 records...
2011/03/22 18:08:13:509 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 65000 records...
2011/03/22 18:08:17:159 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 66667 records.
2011/03/22 18:08:17:159 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 1.
2011/03/22 18:08:26:487 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 60000 records...
2011/03/22 18:08:37:314 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 65000 records...
2011/03/22 18:08:40:835 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 66667 records.
2011/03/22 18:08:40:835 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 2.
2011/03/22 18:08:50:147 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 60000 records...
2011/03/22 18:09:00:644 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 65000 records...
2011/03/22 18:09:04:214 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 66667 records.
2011/03/22 18:09:04:214 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 3.
2011/03/22 18:09:13:536 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 60000 records...
2011/03/22 18:09:24:266 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 65000 records...
2011/03/22 18:09:27:911 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 66667 records.
2011/03/22 18:09:27:911 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 4.
2011/03/22 18:09:37:557 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 60000 records...
2011/03/22 18:09:48:041 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 65000 records...
2011/03/22 18:09:51:679 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 66667 records.
2011/03/22 18:09:51:679 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 5.
2011/03/22 18:10:01:368 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 60000 records...
2011/03/22 18:10:12:666 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 65000 records...
2011/03/22 18:10:16:135 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 66667 records.
2011/03/22 18:10:16:135 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 6.
2011/03/22 18:10:25:725 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 60000 records...
2011/03/22 18:10:36:228 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 65000 records...
2011/03/22 18:10:39:853 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 66667 records.
2011/03/22 18:10:39:853 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 7.
2011/03/22 18:10:49:127 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 60000 records...
2011/03/22 18:11:00:720 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 65000 records...
2011/03/22 18:11:04:579 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 66667 records.
2011/03/22 18:11:04:579 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Processing LoadDriver 8.
2011/03/22 18:11:14:461 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 60000 records...
2011/03/22 18:11:25:852 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added 65000 records...
2011/03/22 18:11:29:357 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Added up to 66667 records.
2011/03/22 18:11:29:357 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Stopping client...
2011/03/22 18:11:29:357 EDT [INFO ][Thread-2] HOST perf01.mw.lab.eng.bos.redhat.com:rootProcess:E - Data loading done.

I see this happening on many of the data stress tests when moving from one iteration to another.
It could be a GC issue, but I wanted to note it here for the record.

Comment 1 Richard Achmatowicz 2011-03-22 22:41:50 UTC

The output above is from the job: http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-stress-data-size4-hotrod/

Comment 2 Richard Achmatowicz 2011-03-22 22:42:28 UTC

Run #9: http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-stress-data-size4-hotrod/9/

Comment 3 Richard Achmatowicz 2011-03-23 11:55:47 UTC

This issue is preventing us from completing full rounds of the data stress tests.
We are investigating the implementation of a parallel data loader which uses multiple threads to load data into the grid after each iteration.
I'm also going to start debugging this issue specifically today.

Comment 4 Richard Achmatowicz 2011-03-23 12:10:45 UTC

Some additional data concerning this particular job, suggested by Michal. The SmartFrog configuration looks like:
   initialSessionCount 800 ;
   datasetCount 400000 ;
   datasetSize 4294 ;
   datasetCountIncrement 100000 ;
which means that on each iteration, 100,000 keys with payload of 4294 bytes are loaded, or roughly 410 Mb of data.

The data grid is made up of 4 x 4096 Mb JVM heaps.

The loading takes place under constant load by 800 clients, making cca 32MB/s constant total throughput (800clients * 10reqs/sec *
4294B/req/client)

Comment 5 Richard Achmatowicz 2011-03-23 12:15:54 UTC

The loading configuration:
perf01 -> stress test controller + data loader
perf02-perf10 -> load drivers containing 800 clients
perf17-perf20 -> cluster nodes (on machines using large pages)

So, one load driver thread (doing only puts) is competing with 800 client threads (doing gets followed by puts).
The clients remain active during the data loading in order to maintain client stress on the cluster.

Comment 6 Galder Zamarreño 2011-03-23 17:14:04 UTC

Is this limited to Hot Rod stress tests only?

What is the Hot Rod client configuration used?

Does each client create its own RemoteCacheManager? If so, each client might be creating a connection to each servers. So, if you have 5000 clients and 10 nodes, that'd be 50.000 connections. A RemoteCacheManager can indeed be shared by multiple clients but Hot Rod client configuration would need to be tweaked.

The options that can be configured can be found in http://docs.jboss.org/infinispan/4.2/apidocs/org/infinispan/client/hotrod/RemoteCacheManager.html

Comment 7 Richard Achmatowicz 2011-03-23 17:22:59 UTC

Galder, in an attempt to deal with this issue, Michal has parallelized the data loader component, so that now numThreads clients are set up to perform puts and initialise the data, all from the same process. We are still testing this out, but it seems to have had a big impact on this problem. So this may be a testing framework issue.

I'll report back with more info once I get a few larger runs tested.

Comment 8 Richard Achmatowicz 2011-03-23 21:08:35 UTC

OK. The changes which Michal made to the data loader sped things up a bit, but I was still hitting a wall around iteration 9 of 12. 
The problem is this:
- by using vmstat on the controller, a couple drivers and all cluster nodes, you can see that the controller and driver CPU usage operates normally up to iteration 6 or 7 (see vmstat-normal.gif)
- at about iteration 7 or so, the controller and driver become more and more idle, whereas the cluster nodes use progressively more CPU (see vmstat-gcs.gif)
- by using jstat -gcinfo to look at garbage collection on all nodes, you can see that there is very little garbage collection on the controllers and the drivers, but many, many full GCs on most of the cluster nodes - about one full GC every 5 seconds or so (see fullGCs.gif)

Using jmap to get a snapshot of the heap on one of the nodes which is garbage collecting shows that the tenured generation is basically full (see heapperf18.txt) and so has to be GC'd on a frequent basis.

Comment 9 Richard Achmatowicz 2011-03-23 21:10:21 UTC

Attachment: Added: vmstat-normal.gif
Attachment: Added: vmstat-gcs.gif
Attachment: Added: FullGCs.gif
Attachment: Added: heapPerf17.txt

Comment 10 Richard Achmatowicz 2011-03-23 21:20:05 UTC

Because the heap is of a fixed size (4096 Mb) and -XX:NewGen=3, then NewGen is going to be 25% and TenuredGen 75%. 
You can see from the heap stats that NewGeneration has been sized to 1024 Mb (921 + 102) and that Tenured is also sized at 3072 Mb - the maximum in both cases. So this seems to be a GC tuning issue. 

It seems the only options available are to make tenured a little larger (by shrinking newGen) and trying to delay the promotion of objects from newGen to tenuredGen as long as possible, so they may get GC'd in a minor collection instread of a full collection.

Going to give some options a try.

Comment 11 Richard Achmatowicz 2011-03-23 22:49:05 UTC

On the other hand, this may just be an unavoidable consequence of non-uniform distribution. I just ran another job and one node of four had double the full GCs and had consistent tenured size of 90% and above while the other three nodes were at the 60% mark. In trying to stress the overall data grid capacity to 80%, with uneven distribution, some heaps are going to fill up and block progress before that saturation takes place. I suspect this is what is happening.

Comment 12 Richard Achmatowicz 2011-03-24 18:50:14 UTC

I tried running some more tests today and found that loading 100,000 records between iterations was taking 25 minutes, too long.
To reduce this time, I had two options:
(i) load fewer keys and use larger value sizes
(ii) scale the test down uniformly

The first option could be used to cut the number of keys from a top value of 1,400,000 down to some smaller number. But this would take a great load of GC (fewer pieces of garbage to collect) so I wasn't too keen on that.

The second option would be to cut the heap/perm gen size in half and scale down accordingly, going from:

datasetCount 400000 ;
datasetSize 4294 ;
datasetCountIncrement 100000 ;
maxIterations 13 ;
jvmOpts "-server -Xmx4096m -XX:MaxPermSize=512m -XX:+UseLargePages -XX:NewRatio=4 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC";

datasetCount 200000 ;
datasetSize 2147 ;
datasetCountIncrement 50000 ;
maxIterations 13 ;
jvmOpts "-server -Xmx2048m -XX:MaxPermSize=256m -XX:+UseLargePages -XX:NewRatio=4 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC";

This still stresses the heap from 20% to 80% as before, but it does so now having relatively more RAM to backup the VM used.
Making this change cut the load time from 25 minutes per iteration to 2 minutes per iteration. This solved the load time problem.
Why the dramatic difference in load time? Can't say just yet, although the vmstat reports do not show swapping activity with the larger heap.

However, the test was still getting stuck on the last two iterations, because two of the four processes had their heaps saturated and were performing full GCs one after the other (see GCs-last.gif). So, rather than struggle when there is a known problem , i've reduced the percentage heap range from 20% to 70%.

Comment 13 Richard Achmatowicz 2011-03-24 18:50:43 UTC

Attachment: Added: GCs-last.gif

Comment 14 Galder Zamarreño 2011-03-28 08:41:45 UTC

So Richard, to sum up, you think that the hang up issue between iterations is due to uneven distribution and some nodes utilising too much memory, correct? I take from your comments above that the driver which sends the operations has no particular issues here?

Also, not sure that you answered my questions above:
1. Do you see these pauses between loads in both Memcached and Hot Rod runs, or only in Hot Rod case?
2. What is the current set up in terms of RemoteCacheManager usage? Are you moving towards to a shared RCM for all clients in same JVM?

Comment 15 Radoslav Husar 2011-03-28 14:32:50 UTC

> Are you moving towards to a shared RCM for all clients in same JVM?


We should probably discuss that team-wide for all the existing tests or setup all tests with both settings. The client stress tests are about stressing with lots of clients, so that might be contra-productive. Maybe only consider that for data stress test?

In SmartFrog world, set the client I created to use the shared RCM: 

{code}
// using shared instance of cache manager
HOTROD_SHARED extends ClientSettings {
...
{code}

Comment 16 Richard Achmatowicz 2011-03-31 19:58:02 UTC

There seem to be two issues here:
(i) tests running slowly because of length of time to load data between iterations
(ii) tests hanging due to heaps filling up and triggering continuous full GCs, where some heaps reavch 100% considerably faster than others

To address (i):

I ran a 3 iteration comparative test to check the data loading times. The initial load take place with no client activity present; the iterations have 2000 clients interacting during data load (which is where the problem arises in the data stress tests).

cluster size 4, 2000 clients, initial load 200k, interation load 100k:

||client||initial load||1st iteration||2nd iteration||
|HotRod|1 min|22 min|23 min|
|HotRod(shared)|1 min|3 min|5 min|
|Memcached|1 min|5 min|7 min|
|REST|2 min|19 min|21 min|

So, it appears that the choice of client significantly affects the data loading time.

Comment 17 Richard Achmatowicz 2011-04-01 17:06:49 UTC

Even with the data loading time improved by choosing the HotRod shared / memcached client, the loading times were still very bad when I moved to 4000 clients:

cluster size 4, 2000 clients, initial load 200k, interation load 1

||client||initial load||1st iteration||2nd iteration||
|HotRod (shared)|1 min|45 min| didn't bother|

By using jstat -gcutil, I could see that 23 minutes of the 45 minute load time was taken up in GC time, so throughput was 22/45*100 or roughly only 50%. This was using the following JVM settings:

java -server -Xmx4096m -XX:MaxPermSize=512m -XX:+UseLargePages -XX:NewRatio=3 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

Comment 18 Richard Achmatowicz 2011-04-01 17:24:56 UTC

By running some experiments on very small iterations with a fixed number of clients, I found the following relative results:

||NewRatio|| Tenured% used||YGC||FGC||GC Time||
|1|64|1704|36|12 sec|
|2|69|2370|34|17 sec|
|3|78|3442|32|24 sec|
|4|83|4496|32|30 sec|

By changing the NewRatio from 3 to 1 (so that the heap space used for old objects is roughly the same aize as that used for new)
there was a big improvement in load times:

cluster size 4, 4000 clients, initial load 200k, iteration load 100k, NewRatio=3
|Iteration||YGC||FGC||GC Time||Iteration Time||
|1st iteration|237602|400|23 min|45 min|
|2nd iteration|n/a|n/a|n/a|n/a|

cluster size 4, 4000 clients, initial load 200k, iteration load 100k, NewRatio=1
|Iteration||YGC||FGC||GC Time||Iteration Time||
|1st iteration|63719|176|6 min|11 min|
|2nd iteration|108316|256|10 min||13 min|

Comment 19 Richard Achmatowicz 2011-04-01 17:28:32 UTC

I'm changing the defaults for the data stress tests to use NewRatio=1 to see if the tests now complete with larger numbers of clients.

Comment 20 Galder Zamarreño 2011-05-09 14:43:09 UTC

Richard, what's the status on this?

Comment 21 Richard Achmatowicz 2011-05-09 15:06:18 UTC

Galder

We traced the long load times down to two factors:
(i) delay in increasing the thread count on each iteration - with large thread counts, it was taking a long time to create the new threads on all the load drivers
(ii) delay in loading data into the grid while under client stress

The first issue was addressed by parallelizing the SmartFrog controller code responsible for increasing the thread counts on each of the driver nodees. This was previously performed serially (create new threads on load driver 1, then create new threads on load driver 2, etc). 

The second issue was addressed by adding a flag stopClientsOnLoad which causes all client side drivers to stop interaction with the grid while the data in the grid is being updated, restarting them after the data is loaded and before the next iteration begins.

GC tuning helped to some extent here but the two measures above made the greatest difference. So, we have data stress test runs which now do not suffer from the original issue.

Comment 22 Galder Zamarreño 2011-05-10 08:51:13 UTC

Right, so is there any reason to keep this JIRA open then? It doesn't look like it :)

Comment 23 Richard Achmatowicz 2011-05-10 13:01:07 UTC

Issue was investigated and resolved.