| Summary: | Data stress performance: Long delays in putting data to a cluster in use by other clients between test iterations | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Data Grid 5 | Reporter: | Richard Achmatowicz <rachmato> | ||||||||||||
| Component: | Infinispan | Assignee: | Default User <jbpapp-maint> | ||||||||||||
| Status: | CLOSED NEXTRELEASE | QA Contact: | |||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||
| Priority: | high | ||||||||||||||
| Version: | unspecified | CC: | galder.zamarreno, mlinhard, rachmato, rhusar | ||||||||||||
| Target Milestone: | --- | ||||||||||||||
| Target Release: | EAP 5.1.0 EDG TP | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| URL: | http://jira.jboss.org/jira/browse/EDG-106 | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2011-05-10 13:01:07 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Richard Achmatowicz
2011-03-22 22:39:36 UTC
The output above is from the job: http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-stress-data-size4-hotrod/ This issue is preventing us from completing full rounds of the data stress tests. We are investigating the implementation of a parallel data loader which uses multiple threads to load data into the grid after each iteration. I'm also going to start debugging this issue specifically today. Some additional data concerning this particular job, suggested by Michal. The SmartFrog configuration looks like: initialSessionCount 800 ; datasetCount 400000 ; datasetSize 4294 ; datasetCountIncrement 100000 ; which means that on each iteration, 100,000 keys with payload of 4294 bytes are loaded, or roughly 410 Mb of data. The data grid is made up of 4 x 4096 Mb JVM heaps. The loading takes place under constant load by 800 clients, making cca 32MB/s constant total throughput (800clients * 10reqs/sec * 4294B/req/client) The loading configuration: perf01 -> stress test controller + data loader perf02-perf10 -> load drivers containing 800 clients perf17-perf20 -> cluster nodes (on machines using large pages) So, one load driver thread (doing only puts) is competing with 800 client threads (doing gets followed by puts). The clients remain active during the data loading in order to maintain client stress on the cluster. Is this limited to Hot Rod stress tests only? What is the Hot Rod client configuration used? Does each client create its own RemoteCacheManager? If so, each client might be creating a connection to each servers. So, if you have 5000 clients and 10 nodes, that'd be 50.000 connections. A RemoteCacheManager can indeed be shared by multiple clients but Hot Rod client configuration would need to be tweaked. The options that can be configured can be found in http://docs.jboss.org/infinispan/4.2/apidocs/org/infinispan/client/hotrod/RemoteCacheManager.html Galder, in an attempt to deal with this issue, Michal has parallelized the data loader component, so that now numThreads clients are set up to perform puts and initialise the data, all from the same process. We are still testing this out, but it seems to have had a big impact on this problem. So this may be a testing framework issue. I'll report back with more info once I get a few larger runs tested. OK. The changes which Michal made to the data loader sped things up a bit, but I was still hitting a wall around iteration 9 of 12. The problem is this: - by using vmstat on the controller, a couple drivers and all cluster nodes, you can see that the controller and driver CPU usage operates normally up to iteration 6 or 7 (see vmstat-normal.gif) - at about iteration 7 or so, the controller and driver become more and more idle, whereas the cluster nodes use progressively more CPU (see vmstat-gcs.gif) - by using jstat -gcinfo to look at garbage collection on all nodes, you can see that there is very little garbage collection on the controllers and the drivers, but many, many full GCs on most of the cluster nodes - about one full GC every 5 seconds or so (see fullGCs.gif) Using jmap to get a snapshot of the heap on one of the nodes which is garbage collecting shows that the tenured generation is basically full (see heapperf18.txt) and so has to be GC'd on a frequent basis. Attachment: Added: vmstat-normal.gif Attachment: Added: vmstat-gcs.gif Attachment: Added: FullGCs.gif Attachment: Added: heapPerf17.txt Because the heap is of a fixed size (4096 Mb) and -XX:NewGen=3, then NewGen is going to be 25% and TenuredGen 75%. You can see from the heap stats that NewGeneration has been sized to 1024 Mb (921 + 102) and that Tenured is also sized at 3072 Mb - the maximum in both cases. So this seems to be a GC tuning issue. It seems the only options available are to make tenured a little larger (by shrinking newGen) and trying to delay the promotion of objects from newGen to tenuredGen as long as possible, so they may get GC'd in a minor collection instread of a full collection. Going to give some options a try. On the other hand, this may just be an unavoidable consequence of non-uniform distribution. I just ran another job and one node of four had double the full GCs and had consistent tenured size of 90% and above while the other three nodes were at the 60% mark. In trying to stress the overall data grid capacity to 80%, with uneven distribution, some heaps are going to fill up and block progress before that saturation takes place. I suspect this is what is happening. I tried running some more tests today and found that loading 100,000 records between iterations was taking 25 minutes, too long. To reduce this time, I had two options: (i) load fewer keys and use larger value sizes (ii) scale the test down uniformly The first option could be used to cut the number of keys from a top value of 1,400,000 down to some smaller number. But this would take a great load of GC (fewer pieces of garbage to collect) so I wasn't too keen on that. The second option would be to cut the heap/perm gen size in half and scale down accordingly, going from: datasetCount 400000 ; datasetSize 4294 ; datasetCountIncrement 100000 ; maxIterations 13 ; jvmOpts "-server -Xmx4096m -XX:MaxPermSize=512m -XX:+UseLargePages -XX:NewRatio=4 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC"; to datasetCount 200000 ; datasetSize 2147 ; datasetCountIncrement 50000 ; maxIterations 13 ; jvmOpts "-server -Xmx2048m -XX:MaxPermSize=256m -XX:+UseLargePages -XX:NewRatio=4 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC"; This still stresses the heap from 20% to 80% as before, but it does so now having relatively more RAM to backup the VM used. Making this change cut the load time from 25 minutes per iteration to 2 minutes per iteration. This solved the load time problem. Why the dramatic difference in load time? Can't say just yet, although the vmstat reports do not show swapping activity with the larger heap. However, the test was still getting stuck on the last two iterations, because two of the four processes had their heaps saturated and were performing full GCs one after the other (see GCs-last.gif). So, rather than struggle when there is a known problem , i've reduced the percentage heap range from 20% to 70%. Attachment: Added: GCs-last.gif So Richard, to sum up, you think that the hang up issue between iterations is due to uneven distribution and some nodes utilising too much memory, correct? I take from your comments above that the driver which sends the operations has no particular issues here? Also, not sure that you answered my questions above: 1. Do you see these pauses between loads in both Memcached and Hot Rod runs, or only in Hot Rod case? 2. What is the current set up in terms of RemoteCacheManager usage? Are you moving towards to a shared RCM for all clients in same JVM? > Are you moving towards to a shared RCM for all clients in same JVM?
We should probably discuss that team-wide for all the existing tests or setup all tests with both settings. The client stress tests are about stressing with lots of clients, so that might be contra-productive. Maybe only consider that for data stress test?
In SmartFrog world, set the client I created to use the shared RCM:
{code}
// using shared instance of cache manager
HOTROD_SHARED extends ClientSettings {
...
{code}
There seem to be two issues here: (i) tests running slowly because of length of time to load data between iterations (ii) tests hanging due to heaps filling up and triggering continuous full GCs, where some heaps reavch 100% considerably faster than others To address (i): I ran a 3 iteration comparative test to check the data loading times. The initial load take place with no client activity present; the iterations have 2000 clients interacting during data load (which is where the problem arises in the data stress tests). cluster size 4, 2000 clients, initial load 200k, interation load 100k: ||client||initial load||1st iteration||2nd iteration|| |HotRod|1 min|22 min|23 min| |HotRod(shared)|1 min|3 min|5 min| |Memcached|1 min|5 min|7 min| |REST|2 min|19 min|21 min| So, it appears that the choice of client significantly affects the data loading time. Even with the data loading time improved by choosing the HotRod shared / memcached client, the loading times were still very bad when I moved to 4000 clients: cluster size 4, 2000 clients, initial load 200k, interation load 1 ||client||initial load||1st iteration||2nd iteration|| |HotRod (shared)|1 min|45 min| didn't bother| By using jstat -gcutil, I could see that 23 minutes of the 45 minute load time was taken up in GC time, so throughput was 22/45*100 or roughly only 50%. This was using the following JVM settings: java -server -Xmx4096m -XX:MaxPermSize=512m -XX:+UseLargePages -XX:NewRatio=3 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC By running some experiments on very small iterations with a fixed number of clients, I found the following relative results: ||NewRatio|| Tenured% used||YGC||FGC||GC Time|| |1|64|1704|36|12 sec| |2|69|2370|34|17 sec| |3|78|3442|32|24 sec| |4|83|4496|32|30 sec| By changing the NewRatio from 3 to 1 (so that the heap space used for old objects is roughly the same aize as that used for new) there was a big improvement in load times: cluster size 4, 4000 clients, initial load 200k, iteration load 100k, NewRatio=3 |Iteration||YGC||FGC||GC Time||Iteration Time|| |1st iteration|237602|400|23 min|45 min| |2nd iteration|n/a|n/a|n/a|n/a| cluster size 4, 4000 clients, initial load 200k, iteration load 100k, NewRatio=1 |Iteration||YGC||FGC||GC Time||Iteration Time|| |1st iteration|63719|176|6 min|11 min| |2nd iteration|108316|256|10 min||13 min| I'm changing the defaults for the data stress tests to use NewRatio=1 to see if the tests now complete with larger numbers of clients. Richard, what's the status on this? Galder We traced the long load times down to two factors: (i) delay in increasing the thread count on each iteration - with large thread counts, it was taking a long time to create the new threads on all the load drivers (ii) delay in loading data into the grid while under client stress The first issue was addressed by parallelizing the SmartFrog controller code responsible for increasing the thread counts on each of the driver nodees. This was previously performed serially (create new threads on load driver 1, then create new threads on load driver 2, etc). The second issue was addressed by adding a flag stopClientsOnLoad which causes all client side drivers to stop interaction with the grid while the data in the grid is being updated, restarting them after the data is loaded and before the next iteration begins. GC tuning helped to some extent here but the two measures above made the greatest difference. So, we have data stress test runs which now do not suffer from the original issue. Right, so is there any reason to keep this JIRA open then? It doesn't look like it :) Issue was investigated and resolved. |