Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 745908 (EDG-114)

Summary:

performance regression AS 7.0.0.Beta3 + ISPN 5.0.0.CR4 -> AS AS7.1.0.Alpha1-SNAPSHOT + ISPN 5.0.0.CR8

Product:

[JBoss] JBoss Data Grid 6

Reporter:

Michal Linhard <mlinhard>

Component:

Infinispan

Assignee:

Default User <jbpapp-maint>

Status:

CLOSED NEXTRELEASE

QA Contact:

Severity:

high

Docs Contact:

Priority:

high

Version:

6.0.0

CC:

galder.zamarreno, jdg-bugs, mlinhard, nobody, paul.ferraro, prabhat.jha, vladimir.blagojevic

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

URL:

http://jira.jboss.org/jira/browse/EDG-114

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-07-27 15:36:21 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
config-new-build.txt	none
config-old-build.txt	none
diff-CR4.txt	none
diff-CR8.txt	none
config-new-build2.txt	none
config-old-build2.txt	none
config-old-build2.txt	none
patch-5.0.0.CR4.txt	none
patch-5.0.0.CR8.txt	none
as7-patch.txt	none
Old-getCache-profiling.png	none
Patched-getCache-profiling.png	none

Description Michal Linhard 2011-07-20 13:54:30 UTC

project_key: EDG

We have a regression

Version: AS 7.0.0.Beta3 + ISPN 5.0.0.CR4
----------------------------------------
build: http://www.qa.jboss.com/~mlinhard/edg6-SNAPSHOT-110609-0949-ispn-5.0.0.CR4.zip
config: https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress-r17561/
results: https://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size4/49/artifact/report/log.txt

Version: AS 7.1.0.Alpha1-SNAPSHOT + ISPN 5.0.0.CR8
----------------------------------------
infinispan 5.0.0.CR8
AS 7.1.0.Alpha1-SNAPSHOT - last commit 45a663a8be42af628cd65683befa3703d38da25c, Tue Jul 19 09:03:39 2011 +0200
build: http://www.qa.jboss.com/~mlinhard/edg6-SNAPSHOT-rev0067-ispn-5.0.0.CR8-as7.1.0.Alpha1-SNAPSHOT-110721-1138.zip
config: https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/
results: https://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size4/53/artifact/report/log.txt

Comment 1 Michal Linhard 2011-07-20 14:22:56 UTC

Beta3 had a JGroups binding problem which caused it to bind to internal network without telling it so, so another run emulates this binding - the performance is better but still poor (3x worse)
https://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size4/51/artifact/report/log.txt

Comment 2 Michal Linhard 2011-07-20 14:31:12 UTC

actually the performance is more than 5x worse wrt max throughput

Comment 3 Michal Linhard 2011-07-20 15:07:32 UTC

I connected via JMX to each EDG6 build running instance and checked the settings

jgroups settings differ in these:
{code}
                                 new         old
NAKACK.XmitTableSize             0           3
UDP.MaxBundleTimeout             30          20
{code}

(probably nothing significant)

hotrod server settings are the same

Comment 4 Michal Linhard 2011-07-21 12:02:20 UTC

reformulated for build containing ispn 5.0.0.CR8

Comment 5 Michal Linhard 2011-07-21 15:37:06 UTC

I profiled perf17 for each build

snapshot for old build:
http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size4/58/artifact/report/jprofiler-snapshot.jps

snapshot for new build:
http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size4/57/artifact/report/jprofiler-snapshot.jps

Comment 6 Michal Linhard 2011-07-22 10:58:13 UTC

I patched infinispan so when creating DefaultCacheManager it dumps it's config to stdout, this is the first iteration of the results, there might be some issues with this approach yet, I have to debug further.

Comment 7 Michal Linhard 2011-07-22 10:58:13 UTC

Attachment: Added: config-new-build.txt
Attachment: Added: config-old-build.txt
Attachment: Added: diff-CR4.txt
Attachment: Added: diff-CR8.txt

Comment 8 Galder Zamarreño 2011-07-22 11:57:36 UTC

The only noticeable thing that has changed since CR4, is ISPN-1217, but no virtual nodes are configured here, or at least not bigger than 1. I'm having a quick look to the snapshots.

Comment 9 Michal Linhard 2011-07-22 12:49:09 UTC

attaching updated configs

Comment 10 Michal Linhard 2011-07-22 12:49:09 UTC

Attachment: Added: config-new-build2.txt
Attachment: Added: config-old-build2.txt

Comment 11 Michal Linhard 2011-07-22 12:49:49 UTC

the configuration of both instances weren't the same, I'm going to prepare new hudson runs with updated/synced configs

Comment 12 Michal Linhard 2011-07-22 13:06:30 UTC

these config properties were different: (value that was set in the new build tests is in the brackets)

{code}
infinispan.global.transport.strictPeerToPeer = true (new false)
infinispan.configuration.locking.isolationLevel = READ_COMMITTED (new REPEATABLE_READ)
infinispan.configuration.transaction.useSynchronization = false (new true)
infinispan.configuration.eviction.maxEntries = -1 (new 10000)
infinispan.configuration.clustering.stateRetrieval.timeout = 10000 (new 240000)
infinispan.configuration.clustering.hash.rehashWait = 120000 (new 60000)
infinispan.configuration.storeAsBinary.enabled = false (new true)
infinispan.configuration.lazyDeserialization.enabled = false (new true)
{code}

Comment 13 Michal Linhard 2011-07-22 13:09:07 UTC

Attachment: Added: config-old-build2.txt

Comment 14 Michal Linhard 2011-07-22 13:10:06 UTC

patch*.txt contain patches that were used to print the configs

Comment 15 Michal Linhard 2011-07-22 13:10:06 UTC

Attachment: Added: patch-5.0.0.CR4.txt
Attachment: Added: patch-5.0.0.CR8.txt

Comment 16 Michal Linhard 2011-07-25 12:31:06 UTC

checked JGroupsTransport.asyncExecutor settings (they differ in both containers)

AS 7.1.0.Alpha1-SNAPSHOT:
org.jboss.threads.JBossThreadPoolExecutor
corePoolSize 100
keepAliveTime 1000000000
maximumPoolSize	100	

AS 7.0.0.Beta3:
java.util.concurrent.ThreadPoolExecutor
corePoolSize	25	
keepAliveTime	0	
maximumPoolSize	25	

(I'm not sure however, whether this has any serious impact or, whether asyncExecutor is even in action with DIST_SYNC mode)

Comment 17 Michal Linhard 2011-07-25 15:00:04 UTC

jprofiler snapshots from one node tests:
NEW BUILD:
http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size1/5/artifact/report/jprofiler-snapshot.jps
OLD BUILD:
http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size1/6/artifact/report/jprofiler-snapshot.jps

the multinode tests (clusterSize>numOwners) have a problem of non-determinism (on each test run the keys are distributed differently), which leads to possibly unfair performance comparison (even more so when profiling just one node)

Comment 18 Michal Linhard 2011-07-25 16:05:46 UTC

this is not a problem of hot rod server only:
memcached also degraded between the said versions:
old build:
http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-memcached-size4/2/artifact/report/log.txt
new build:
http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-memcached-size4/4/artifact/report/log.txt

Comment 19 Michal Linhard 2011-07-25 16:16:08 UTC

jstack dumps:
new build:
http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size4/60/artifact/report/jstack_dumps/
old build:
http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size4/61/artifact/report/jstack_dumps/

Comment 20 Michal Linhard 2011-07-25 20:32:21 UTC

storeAsBinary + lazyDeserialization being enabled in the new build configuration has serious performance effects due to ISPN-1267
(actually there's no way to configure this in AS7, probably for a good reason, see AS7-1385 for the discussion)

this manifested when I created a patched AS7 build where I hardwired storeAsBinary to be false.
see results of the build: http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size4/62
(build here: http://www.qa.jboss.com/~mlinhard/edg6-SNAPSHOT-rev0068-ispn-5.0.0.CR8-as7.1.0.Alpha1-SNAPSHOT-110725-2203-patched.zip)

Comment 21 Michal Linhard 2011-07-25 20:33:02 UTC

Link: Added: This issue depends ISPN-1267

Comment 22 Michal Linhard 2011-07-26 09:01:41 UTC

jprofiler snapshot of the patched build run: http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size1/7/artifact/report/jprofiler-snapshot.jps

Comment 23 Michal Linhard 2011-07-26 09:07:17 UTC

Attachment: Added: as7-patch.txt

Comment 24 Vladimir Blagojevic 2011-07-26 12:57:05 UTC

Michal, can't we run the same configuration described in https://issues.jboss.org/browse/JBPAPP-6865?focusedCommentId=12615897&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12615897 that will give us a lot of clues if we made regressions somewhere in our code or we should look for clues in configuration settings?

Comment 25 Michal Linhard 2011-07-26 13:13:57 UTC

btw the interceptor chains are the same now:

7.1.0.Alpha1-SNAPSHOT
---------------------

	>> org.infinispan.interceptors.InvocationContextInterceptor
	>> org.infinispan.interceptors.CacheMgmtInterceptor
	>> org.infinispan.interceptors.DistTxInterceptor
	>> org.infinispan.interceptors.NotificationInterceptor
	>> org.infinispan.interceptors.DistLockingInterceptor
	>> org.infinispan.interceptors.DistributionInterceptor
	>> org.infinispan.interceptors.CallInterceptor

7.0.0.Beta3
-----------

	>> org.infinispan.interceptors.InvocationContextInterceptor
	>> org.infinispan.interceptors.CacheMgmtInterceptor
	>> org.infinispan.interceptors.DistTxInterceptor
	>> org.infinispan.interceptors.NotificationInterceptor
	>> org.infinispan.interceptors.DistLockingInterceptor
	>> org.infinispan.interceptors.DistributionInterceptor
	>> org.infinispan.interceptors.CallInterceptor

Comment 26 Galder Zamarreño 2011-07-26 17:48:33 UTC

Michal, don't forget to check the HR client configurations to see whether there's any differences, and try to provide a couple of snapshots of the client side with the old set up and the new one.

Comment 27 Michal Linhard 2011-07-26 19:06:19 UTC

yeah, actually I forgot to mention this:

all the test runs are done with the same client side infrastructure. the config is the same, smartfrog code is the same, infinispan versions are the same (5.0.0.CR8).

Comment 28 Galder Zamarreño 2011-07-27 06:25:55 UTC

Ok, so assuming client side is identical in all tests, it must be something on the server side. I don't think the AS has an impact here since the sockets are managed by ourselves. I tried to compare the server side snapshots of the patched and old config but saw no noticeable difference. It's also hard to compare them cos the number of invocations is different.

A way to narrow down the issue might be to compare single server set up with the old and patched set ups cos we'd be comparing the same amount of invocations in both cases. Besides, it could allow us to narrow down the scope of the issue, whether it's core itself, or something clustering related.

Comment 29 Galder Zamarreño 2011-07-27 08:16:37 UTC

Btw, here's the list of issues resolved in between CR5 and CR8: http://goo.gl/WpvQw

I'm looking into the patched jprofiler snapshot more closely now. I still think we should compare standalone behaviour with this test cos that might give us some clues.

For example Michal, comparing the two configs, I see different consistent hash classes being used. The old config used DefaultConsistentHash, and the patched version uses TopologyAwareConsistentHash, and this is not covered in the differences you mentioned above.

Btw, we should upgrade Hudson to the latest JProfiler version available to make sure we're getting the right data, and to avoid me seeing funny things as shown earlier with the MVI interceptor calls.

Comment 30 Galder Zamarreño 2011-07-27 08:19:45 UTC

Correction: The patched version shows DefaultConsistentHash being used too, so the old and patched version use the same consistent hash class.

Comment 31 Galder Zamarreño 2011-07-27 08:39:16 UTC

So, I've gone through the put/get operation stacktraces in the snapshots and they look almost identical at least at the Infinispan level.

The only difference I see is the retrieval of the cache from the cache manager. Whereas in the old codebase, getting the cache simply meant accessing the DefaultCacheManager (pretty fast in the old snapshot - 0.9% of time - 9.7 secs for 297.000 invocations), in the new version it's delegating to the central AS managed cache manager via org.jboss.as.clustering.infinispan.DefaultEmbeddedCacheManager.getCache which takes around 4.5% of time - 30 secs for 328.011 invocations). I'd find it hard to believe this is what's making 1/2 of the performance, but this could easily be proved by taking Infinispan 5.0.0.CR8 and running it in AS 7.0.0.Beta3. If the performance is the same as with CR4, then the difference is in AS and not in Infinispan.

Comment 32 Michal Linhard 2011-07-27 09:55:59 UTC

hmm so far I haven't tried that: ispn 5.0.0.CR8 inside of AS 7.0.0.Beta3, I'm on it ...

Comment 33 Michal Linhard 2011-07-27 11:35:10 UTC

AS 7.0.0.Beta3 + 5.0.0.CR8 performs very similar to the old config (AS 7.0.0.Beta3 + 5.0.0.CR4)
http://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size4/65/artifact/report/log.txt

so we narrowed the problem to AS7 code or EDG6 Alpha integration code or configuration issues.

Comment 34 Vladimir Blagojevic 2011-07-27 12:14:24 UTC

Very cool! I was looking into traces Michal posted and I could not find any regressions in Infinispan! Great work Galder. We both owe you one!

Comment 35 Michal Linhard 2011-07-27 12:22:59 UTC

I compared the runtime state of executors again
I mean
ASYNC_NOTIFICATION_EXECUTOR and ASYNC_TRANSPORT_EXECUTOR
the EVICTION_SCHEDULED_EXECUTOR and ASYNC_REPLICATION_QUEUE_EXECUTOR aren't used in our runs.

{code}
7.0.0.Beta3
-----------

ASYNC_NOTIFICATION_EXECUTOR

java.util.concurrent.ThreadPoolExecutor
corePoolSize	1	
keepAliveTime	0	
maximumPoolSize	1	

ASYNC_TRANSPORT_EXECUTOR

java.util.concurrent.ThreadPoolExecutor

corePoolSize	25	
keepAliveTime	0	
maximumPoolSize	25	

7.1.0.Alpha1-SNAPSHOT
---------------------

ASYNC_NOTIFICATION_EXECUTOR

org.jboss.threads.JBossThreadPoolExecutor
corePoolSize	100	
keepAliveTime	1000000000	
maximumPoolSize	100	

ASYNC_TRANSPORT_EXECUTOR

org.jboss.threads.JBossThreadPoolExecutor
corePoolSize	100	
keepAliveTime	1000000000	
maximumPoolSize	100	
{code}

Comment 36 Galder Zamarreño 2011-07-27 13:00:05 UTC

Michal, thx for trying out the suggested combo, it's a pretty big step forward cos it means we can eliminate Infinispan from the equation.

There's a couple of listeners at least working in the HotRod server. One that calculates cluster ids for the version and another for detecting crashed members. Both are linked to view changes, so don't think they should be having an impact here. Paul Ferraro would be the person to ask on whether any other listeners are attached. Btw, I'll attach a couple of screenshots for him to look at wrt getCache() differences. At first glance, async transport executor should not be a factor here cos caches are configured with DIST_SYNC.

Tbh, the pools are just bigger in the new AS instance, so I don't think it's an issue.

Btw, any further profiling data should include at least org.jboss.as.* profiling. Gathering profiling data for org.jboss.* might be too much.

Comment 37 Galder Zamarreño 2011-07-27 13:01:17 UTC

Attachment: Added: Old-getCache-profiling.png

Comment 38 Galder Zamarreño 2011-07-27 13:01:37 UTC

Attachment: Added: Patched-getCache-profiling.png

Comment 39 Michal Linhard 2011-07-27 13:04:11 UTC

btw we use different jgroups versions in the two builds:

AS 7.0.0.Beta3 uses 2.12.0.Final
AS 7.1.0.Alpha1-SNAPSHOT uses 2.12.1.Final

Comment 40 Galder Zamarreño 2011-07-27 13:12:43 UTC

I was just thinking about that :) - Could potentially be a factor. Something worth trying is putting the same JGroups version + configuration in 7.0.0.Beta3 (+5.0.0.CR8) and see if the performance is still the expected one.

Comment 41 Michal Linhard 2011-07-27 13:19:44 UTC

jgroups runtime executor config isn't synced as well:

{code}
7.1.0.Alpha1-SNAPSHOT
---------------------

default executor:

org.jboss.threads.JBossThreadPoolExecutor
corePoolSize	100	
keepAliveTime	1000000000	
maximumPoolSize	100	

oob executor:

org.jboss.threads.JBossThreadPoolExecutor
corePoolSize	100	
keepAliveTime	1000000000	
maximumPoolSize	100	

timer executor:

org.jboss.as.threads.ScheduledThreadPoolService$ExecutorImpl
corePoolSize	1	
keepAliveTime	0	
maximumPoolSize	2147483647	

7.0.0.Beta3
-----------

default executor:

org.jgroups.util.ThreadManagerThreadPoolExecutor
corePoolSize	20	
keepAliveTime	5000000000	
maximumPoolSize	200	

oob executor:

org.jgroups.util.ThreadManagerThreadPoolExecutor
corePoolSize	20	
keepAliveTime	1000000000	
maximumPoolSize	200	

timer executor:

org.jgroups.util.ThreadManagerThreadPoolExecutor
corePoolSize	4	
keepAliveTime	5000000000	
maximumPoolSize	10	
{code}

so next step now is to run with synced executor settings and then the jgroups ...

Comment 42 Michal Linhard 2011-07-27 13:45:18 UTC

it won't be possible to sync the thread pool configs perfectly: threads subsystem sets corePoolSize=maximumPoolSize, so I'll set the correct maximum and corePoolSize will stay unsynced.

Comment 43 Paul Ferraro 2011-07-27 13:56:20 UTC

Are your JGroups stacks identical for each run?  Check for the existence of FLUSH in both stacks.
One noteworthy change since 7.0.0.Final was the addition of this cache decoration:
https://github.com/jbossas/jboss-as/blob/master/clustering/infinispan/src/main/java/org/jboss/as/clustering/infinispan/ClassLoaderAwareCache.java
which adds a new interceptor to the top of the interceptor stack and cache listener decorations.  This was added on July 14, so I dunno if your snapshot predates this.  This will certainly have some performance implications (due to per-invocation TCCL switching), though I would be surprised if it was solely  responsible for the 0.5x performance drop.

Comment 44 Michal Linhard 2011-07-27 14:02:26 UTC

jgroups stacks are synced, FLUSH is in both.

old one: https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress-r17561/standalone/configuration/jgroups-datagrid-udp.xml
new one: https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/standalone/configuration/standalone.xml

Comment 45 Michal Linhard 2011-07-27 14:41:46 UTC

the synced executor settings produced the same performance for 7.1.0.Alpha1-SNAPSHOT:
https://hudson.qa.jboss.com/hudson/view/EDG6/job/edg-60-stress-client-hotrod-size4/66/artifact/report/log.txt

Comment 46 Galder Zamarreño 2011-07-27 14:43:13 UTC

For future ref, this is what the stack looked like before the synch up: http://pastebin.test.redhat.com/56781

Those JGroups thread pools where resulting in about 1/2 of performance.

Comment 47 Michal Linhard 2011-07-27 14:48:52 UTC

it's rev 18388 of https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/standalone/configuration/standalone.xml
(should the pastebin expire)

Comment 48 Michal Linhard 2011-07-27 15:36:21 UTC

closing, as this turned out to be a config issue

Comment 49 Michal Linhard 2011-07-28 12:59:09 UTC

Link: Added: This issue is related to JBPAPP-6918

Comment 50 Anne-Louise Tangring 2011-10-11 17:09:35 UTC

Docs QE Status: Removed: NEW