Bug 1129223 - Clustering tests throw TimeoutException errors
Summary: Clustering tests throw TimeoutException errors
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Enterprise Portal Platform 6
Classification: JBoss
Component: Portal
Version: 6.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: CR01
: 6.2.0
Assignee: Lucas Ponce
QA Contact: Tomas Kyjovsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-08-12 10:46 UTC by Lucas Ponce
Modified: 2015-02-19 16:33 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-19 16:33:51 UTC
Type: Bug


Attachments (Terms of Use)
JMeter test for 250 users (10.84 KB, text/plain)
2014-08-12 10:48 UTC, Lucas Ponce
no flags Details
Scalability script used with JMeter (320 bytes, text/plain)
2014-08-20 12:03 UTC, Lucas Ponce
no flags Details
stack traces from 2-node anonymous scalability test (10.38 KB, application/zip)
2014-08-20 19:55 UTC, Tomas Kyjovsky
no flags Details
FailoverPortal (7.16 KB, application/zip)
2014-08-20 20:14 UTC, Tomas Kyjovsky
no flags Details
stack traces with detected deadlocks (17.49 KB, application/zip)
2014-08-20 22:55 UTC, Tomas Kyjovsky
no flags Details
perf11 machine server.log with JGroups in TRACE level (10.53 MB, application/octet-stream)
2014-08-27 02:18 UTC, Lucas Ponce
no flags Details
perf12 machine server.log with JGroups in TRACE level (10.94 MB, application/octet-stream)
2014-08-27 02:22 UTC, Lucas Ponce
no flags Details
Perf11 threaddump (3.00 MB, application/octet-stream)
2014-08-27 10:22 UTC, Lucas Ponce
no flags Details

Description Lucas Ponce 2014-08-12 10:46:52 UTC
Description of problem:

Running a clustering test with 2 nodes throws multiple TimeoutException errors in cascade.

Version-Release number of selected component (if applicable):

JBoss Portal 6.2.ER04.

How reproducible:

Always.

Steps to Reproduce:
1. Set up a 2 nodes cluster using default configuration + increment of max http connections.
2. Running a simple load test using JMeter (attached).


Actual results:

Multiple TimeoutException errors with cascade of other errors

Expected results:

No errors.

Additional info:

This issue is related to:

https://jira.exoplatform.org/browse/JCR-2292

and 

https://community.jboss.org/message/868977#868977

a first test with JCR 1.6.1 and increase on jgroups pools seems to solve the issue.

Comment 1 Lucas Ponce 2014-08-12 10:48:33 UTC
Created attachment 926021 [details]
JMeter test for 250 users

JMeter test for cluster configured in localhost, with 8080 port for node1 and 8180 port for node2.
No need to configure a loadbalancer in front to reproduce test, as JMeter distributes requests on both nodes.

Comment 3 Lucas Ponce 2014-08-13 18:05:45 UTC
Update from tests ran locally:

- Build a JBoss Portal 6.2.ER4 version with eXo JCR 1.6.1 upgrade.

- Setting these parameters before local tests on each node:

  · standalone-ha.xml: 

<connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http" max-connections="2500"/>

  · gatein-udp.xml:

thread_pool.max_threads="250"
oob_thread_pool.max_threads="250"

Results:

- Timeout mitigated for same 250 users tests.
- Consistency on JMeter results (several tests):

#1: Make Report = 394518 in   900s =  438,4/s Avg:   282 Min:     3 Max: 149262 Err:     0 (0,00%)
#2: Make Report = 481386 in   900s =  534,8/s Avg:   132 Min:     3 Max: 142365 Err:     0 (0,00%)
#3: Make Report = 490973 in   908s =  541,0/s Avg:    96 Min:     3 Max: 121066 Err:     0 (0,00%)
#4: Make Report = 462153 in   900s =  513,6/s Avg:   122 Min:     3 Max: 120416 Err:     0 (0,00%)
#5: Make Report = 493668 in   900s =  548,5/s Avg:   116 Min:     3 Max: 166606 Err:     0 (0,00%)

- Test #1 is considered as warmup.
- These tests are local, sharing resources with JMeter.

Comment 4 Lucas Ponce 2014-08-14 15:57:43 UTC
Update from tests ran locally:

- Build a JBoss Portal 6.2.ER4 version with eXo JCR 1.6.1 upgrade.

  · standalone.conf:

-Xms2048m -Xmx2048m

  · standalone-ha.xml: 

<connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http" max-connections="2500"/>

  · gatein-udp.xml:

thread_pool.max_threads="500"
thread_pool.rejection_policy="Discard"

oob_thread_pool.max_threads="500"
oob_thread_pool.rejection_policy="Discard"

Results:

- Timeout mitigated for same 1000 users tests (anonymous test).
- Consistency on JMeter results (several tests):

#1: Make Report = 355620 in  1073s =  331,5/s Avg:  2309 Min:     3 Max: 562485 Err:     0 (0,00%)
#2: Make Report = 425038 in   924s =  460,0/s Avg:  1399 Min:     3 Max: 433826 Err:     0 (0,00%)
#3: Make Report = 411449 in  1039s =  396,2/s Avg:  1774 Min:     3 Max: 690993 Err:     0 (0,00%)
#4: Make Report = 396284 in   996s =  398,1/s Avg:  1337 Min:     2 Max: 394994 Err:     0 (0,00%)
#5: Make Report = 431482 in   995s =  433,7/s Avg:  1534 Min:     3 Max: 519495 Err:     0 (0,00%)

# Total tests: 3127476
# Long requests: 14432 (0.46146 % - more than 10 seconds)

- Test #1 is considered as warmup.
- These tests are local, sharing resources with JMeter.

Comment 5 Lucas Ponce 2014-08-20 12:03:24 UTC
Created attachment 928778 [details]
Scalability script used with JMeter

Comment 6 Tomas Kyjovsky 2014-08-20 19:55:36 UTC
Created attachment 928934 [details]
stack traces from 2-node anonymous scalability test

Attaching stack traces from a 2-node scalability test.
- Node1 (perf11 machine): server.log has 170 MB; 99 % of the log contains SuspectExceptions.
- Node2 (perf12 machine): server.log has 100 MB; first 30-40 % of the log is locking timeout exceptions; the rest contains JCR exceptions.

Comment 7 Tomas Kyjovsky 2014-08-20 20:14:13 UTC
Created attachment 928938 [details]
FailoverPortal

Important note: The anonymous scalability tests in MW lab are confingured to put load on /portal/FailoverPortal/FailoverPage instead of /portal/classic/home. Attaching the FailoverPortal definition and failover portlet.

(The failover portlet which is on the page is mainly used for failover testing. It increments a replicated variable which is checked by the test after the failover.)

Comment 8 Tomas Kyjovsky 2014-08-20 22:55:56 UTC
Created attachment 928977 [details]
stack traces with detected deadlocks

When deadlock detection is enabled many deadlocks are reported on one of the portal nodes by infinispan:
- node1 log: 300 MB; first 50% are deadlock exceptions the rest are locking timeouts
- node2 log: 500 MB; 2x remote exception; the rest is locking/tx exceptions

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/epp6_portal_scalability_anonymous_2nodes/127/

Comment 10 Tomas Kyjovsky 2014-08-21 17:43:01 UTC
Additional info: Deadlocks were detected with both optimistic and pessimistic locking mode.

(In reply to Tomas Kyjovsky from comment #8)
> Created attachment 928977 [details]
> stack traces with detected deadlocks
> 
> When deadlock detection is enabled many deadlocks are reported on one of the
> portal nodes by infinispan:
> - node1 log: 300 MB; first 50% are deadlock exceptions the rest are locking
> timeouts
> - node2 log: 500 MB; 2x remote exception; the rest is locking/tx exceptions
> 
> https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/
> epp6_portal_scalability_anonymous_2nodes/127/

Comment 13 Lucas Ponce 2014-08-27 02:18:11 UTC
Created attachment 931222 [details]
perf11 machine server.log with JGroups in TRACE level

Comment 14 Lucas Ponce 2014-08-27 02:22:38 UTC
Created attachment 931223 [details]
perf12 machine server.log with JGroups in TRACE level

Comment 15 Lucas Ponce 2014-08-27 10:22:34 UTC
Created attachment 931375 [details]
Perf11 threaddump

Comment 16 Lucas Ponce 2014-08-27 11:38:01 UTC
Related:

Similar issue:

https://www.mail-archive.com/infinispan-dev%40lists.jboss.org/msg08128.html

Comment 17 Lucas Ponce 2014-08-27 20:47:49 UTC
https://issues.jboss.org/browse/ISPN-4672

Comment 18 Lucas Ponce 2014-08-28 21:58:10 UTC
Changes applied in last tests:

JBOSS_PORTAL_CONF=$JBOSS_PORTAL/gatein/gatein.ear/portal.war/WEB-INF/conf

- $JBOSS_PORTAL_CONF/infinispan/cluster/config.xml

  Change sync configuration per async:

  <async useReplQueue="true" />

- $JBOSS_PORTAL_CONF/common/common-configuration.xml
  
  Increase defaultWakeUpInterval in GenericExoCacheCreator object:

  <field name="defaultWakeUpInterval"><long>30000</long></field>

- $JBOSS_PORTAL_CONF/portal/portal-configuration.xml

  Increase lifespam value in GenericExoCacheConfig object:

  <field name="lifespan"><long>${gatein.cache.page.expiration:3600000}</long></field>

Results:

- Reducin hits on cache for checking eviction entries, results are under expected limits.

Comment 19 Dan Berindei 2014-08-29 19:13:39 UTC
I thing the problem here is closer to https://issues.jboss.org/browse/ISPN-2392

Infinispan 5.2.10 in replicated mode was still acquiring the key locks on all the nodes in parallel, so it was possible to get deadlocks. But the primary owner didn't do any RPCs while holding the locks, so ISPN-4672 doesn't apply.

I would also recommend setting <transaction useSynchronization="false"/>. With the default setting, errors during the commit phase are suppressed in the TM, and that's usually not a good thing.

Another thing I noticed in your configuration is that the queue for the OOB thread pool is enabled. We recommend disabling the OOB queue, it can make things worse when a commit or state transfer command is queued after a lot of prepare commands.

Comment 20 Tomas Kyjovsky 2014-10-10 09:22:41 UTC
I can still see some TimeoutExceptions during "logged users" scalability tests for CR1 and CR2, however it's only sporadic occurence compared to ER4. I don't consider this a blocker especially because the performance improved but I think should investigate a bit more after the GA.

Comment 21 Lucas Ponce 2015-02-19 16:33:51 UTC
In the next ER we have a downgrade of JCR component to 1.15.x version.
So clustering issues shouldn't appear with that version.


Note You need to log in before you can comment on or make changes to this bug.