Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1129223

Summary:

Clustering tests throw TimeoutException errors

Product:

[JBoss] JBoss Enterprise Portal Platform 6

Reporter:

Lucas Ponce <lponce>

Component:

Portal

Assignee:

Lucas Ponce <lponce>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Tomas Kyjovsky <tkyjovsk>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

6.2.0

CC:

dberinde, epp-bugs, theute

Target Milestone:

CR01

Target Release:

6.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-02-19 16:33:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
JMeter test for 250 users	none
Scalability script used with JMeter	none
stack traces from 2-node anonymous scalability test	none
FailoverPortal	none
stack traces with detected deadlocks	none
perf11 machine server.log with JGroups in TRACE level	none
perf12 machine server.log with JGroups in TRACE level	none
Perf11 threaddump	none

Description Lucas Ponce 2014-08-12 10:46:52 UTC

Description of problem:

Running a clustering test with 2 nodes throws multiple TimeoutException errors in cascade.

Version-Release number of selected component (if applicable):

JBoss Portal 6.2.ER04.

How reproducible:

Always.

Steps to Reproduce:
1. Set up a 2 nodes cluster using default configuration + increment of max http connections.
2. Running a simple load test using JMeter (attached).


Actual results:

Multiple TimeoutException errors with cascade of other errors

Expected results:

No errors.

Additional info:

This issue is related to:

https://jira.exoplatform.org/browse/JCR-2292

and 

https://community.jboss.org/message/868977#868977

a first test with JCR 1.6.1 and increase on jgroups pools seems to solve the issue.

Comment 1 Lucas Ponce 2014-08-12 10:48:33 UTC

Created attachment 926021 [details]
JMeter test for 250 users

JMeter test for cluster configured in localhost, with 8080 port for node1 and 8180 port for node2.
No need to configure a loadbalancer in front to reproduce test, as JMeter distributes requests on both nodes.

Comment 3 Lucas Ponce 2014-08-13 18:05:45 UTC

Update from tests ran locally:

- Build a JBoss Portal 6.2.ER4 version with eXo JCR 1.6.1 upgrade.

- Setting these parameters before local tests on each node:

  · standalone-ha.xml: 

<connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http" max-connections="2500"/>

  · gatein-udp.xml:

thread_pool.max_threads="250"
oob_thread_pool.max_threads="250"

Results:

- Timeout mitigated for same 250 users tests.
- Consistency on JMeter results (several tests):

#1: Make Report = 394518 in   900s =  438,4/s Avg:   282 Min:     3 Max: 149262 Err:     0 (0,00%)
#2: Make Report = 481386 in   900s =  534,8/s Avg:   132 Min:     3 Max: 142365 Err:     0 (0,00%)
#3: Make Report = 490973 in   908s =  541,0/s Avg:    96 Min:     3 Max: 121066 Err:     0 (0,00%)
#4: Make Report = 462153 in   900s =  513,6/s Avg:   122 Min:     3 Max: 120416 Err:     0 (0,00%)
#5: Make Report = 493668 in   900s =  548,5/s Avg:   116 Min:     3 Max: 166606 Err:     0 (0,00%)

- Test #1 is considered as warmup.
- These tests are local, sharing resources with JMeter.

Comment 4 Lucas Ponce 2014-08-14 15:57:43 UTC

Update from tests ran locally:

- Build a JBoss Portal 6.2.ER4 version with eXo JCR 1.6.1 upgrade.

  · standalone.conf:

-Xms2048m -Xmx2048m

  · standalone-ha.xml: 

<connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http" max-connections="2500"/>

  · gatein-udp.xml:

thread_pool.max_threads="500"
thread_pool.rejection_policy="Discard"

oob_thread_pool.max_threads="500"
oob_thread_pool.rejection_policy="Discard"

Results:

- Timeout mitigated for same 1000 users tests (anonymous test).
- Consistency on JMeter results (several tests):

#1: Make Report = 355620 in  1073s =  331,5/s Avg:  2309 Min:     3 Max: 562485 Err:     0 (0,00%)
#2: Make Report = 425038 in   924s =  460,0/s Avg:  1399 Min:     3 Max: 433826 Err:     0 (0,00%)
#3: Make Report = 411449 in  1039s =  396,2/s Avg:  1774 Min:     3 Max: 690993 Err:     0 (0,00%)
#4: Make Report = 396284 in   996s =  398,1/s Avg:  1337 Min:     2 Max: 394994 Err:     0 (0,00%)
#5: Make Report = 431482 in   995s =  433,7/s Avg:  1534 Min:     3 Max: 519495 Err:     0 (0,00%)

# Total tests: 3127476
# Long requests: 14432 (0.46146 % - more than 10 seconds)

- Test #1 is considered as warmup.
- These tests are local, sharing resources with JMeter.

Comment 5 Lucas Ponce 2014-08-20 12:03:24 UTC

Created attachment 928778 [details]
Scalability script used with JMeter

Comment 6 Tomas Kyjovsky 2014-08-20 19:55:36 UTC

Created attachment 928934 [details]
stack traces from 2-node anonymous scalability test

Attaching stack traces from a 2-node scalability test.
- Node1 (perf11 machine): server.log has 170 MB; 99 % of the log contains SuspectExceptions.
- Node2 (perf12 machine): server.log has 100 MB; first 30-40 % of the log is locking timeout exceptions; the rest contains JCR exceptions.

Comment 7 Tomas Kyjovsky 2014-08-20 20:14:13 UTC

Created attachment 928938 [details]
FailoverPortal

Important note: The anonymous scalability tests in MW lab are confingured to put load on /portal/FailoverPortal/FailoverPage instead of /portal/classic/home. Attaching the FailoverPortal definition and failover portlet.

(The failover portlet which is on the page is mainly used for failover testing. It increments a replicated variable which is checked by the test after the failover.)

Comment 8 Tomas Kyjovsky 2014-08-20 22:55:56 UTC

Created attachment 928977 [details]
stack traces with detected deadlocks

When deadlock detection is enabled many deadlocks are reported on one of the portal nodes by infinispan:
- node1 log: 300 MB; first 50% are deadlock exceptions the rest are locking timeouts
- node2 log: 500 MB; 2x remote exception; the rest is locking/tx exceptions

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/epp6_portal_scalability_anonymous_2nodes/127/

Comment 10 Tomas Kyjovsky 2014-08-21 17:43:01 UTC

Additional info: Deadlocks were detected with both optimistic and pessimistic locking mode.

(In reply to Tomas Kyjovsky from comment #8)
> Created attachment 928977 [details]
> stack traces with detected deadlocks
> 
> When deadlock detection is enabled many deadlocks are reported on one of the
> portal nodes by infinispan:
> - node1 log: 300 MB; first 50% are deadlock exceptions the rest are locking
> timeouts
> - node2 log: 500 MB; 2x remote exception; the rest is locking/tx exceptions
> 
> https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/
> epp6_portal_scalability_anonymous_2nodes/127/

Comment 13 Lucas Ponce 2014-08-27 02:18:11 UTC

Created attachment 931222 [details]
perf11 machine server.log with JGroups in TRACE level

Comment 14 Lucas Ponce 2014-08-27 02:22:38 UTC

Created attachment 931223 [details]
perf12 machine server.log with JGroups in TRACE level

Comment 15 Lucas Ponce 2014-08-27 10:22:34 UTC

Created attachment 931375 [details]
Perf11 threaddump

Comment 16 Lucas Ponce 2014-08-27 11:38:01 UTC

Related:

Similar issue:

https://www.mail-archive.com/infinispan-dev%40lists.jboss.org/msg08128.html

Comment 17 Lucas Ponce 2014-08-27 20:47:49 UTC

https://issues.jboss.org/browse/ISPN-4672

Comment 18 Lucas Ponce 2014-08-28 21:58:10 UTC

Changes applied in last tests:

JBOSS_PORTAL_CONF=$JBOSS_PORTAL/gatein/gatein.ear/portal.war/WEB-INF/conf

- $JBOSS_PORTAL_CONF/infinispan/cluster/config.xml

  Change sync configuration per async:

  <async useReplQueue="true" />

- $JBOSS_PORTAL_CONF/common/common-configuration.xml
  
  Increase defaultWakeUpInterval in GenericExoCacheCreator object:

  <field name="defaultWakeUpInterval"><long>30000</long></field>

- $JBOSS_PORTAL_CONF/portal/portal-configuration.xml

  Increase lifespam value in GenericExoCacheConfig object:

  <field name="lifespan"><long>${gatein.cache.page.expiration:3600000}</long></field>

Results:

- Reducin hits on cache for checking eviction entries, results are under expected limits.

Comment 19 Dan Berindei 2014-08-29 19:13:39 UTC

I thing the problem here is closer to https://issues.jboss.org/browse/ISPN-2392

Infinispan 5.2.10 in replicated mode was still acquiring the key locks on all the nodes in parallel, so it was possible to get deadlocks. But the primary owner didn't do any RPCs while holding the locks, so ISPN-4672 doesn't apply.

I would also recommend setting <transaction useSynchronization="false"/>. With the default setting, errors during the commit phase are suppressed in the TM, and that's usually not a good thing.

Another thing I noticed in your configuration is that the queue for the OOB thread pool is enabled. We recommend disabling the OOB queue, it can make things worse when a commit or state transfer command is queued after a lot of prepare commands.

Comment 20 Tomas Kyjovsky 2014-10-10 09:22:41 UTC

I can still see some TimeoutExceptions during "logged users" scalability tests for CR1 and CR2, however it's only sporadic occurence compared to ER4. I don't consider this a blocker especially because the performance improved but I think should investigate a bit more after the GA.

Comment 21 Lucas Ponce 2015-02-19 16:33:51 UTC

In the next ER we have a downgrade of JCR component to 1.15.x version.
So clustering issues shouldn't appear with that version.