Description of problem: Running a clustering test with 2 nodes throws multiple TimeoutException errors in cascade. Version-Release number of selected component (if applicable): JBoss Portal 6.2.ER04. How reproducible: Always. Steps to Reproduce: 1. Set up a 2 nodes cluster using default configuration + increment of max http connections. 2. Running a simple load test using JMeter (attached). Actual results: Multiple TimeoutException errors with cascade of other errors Expected results: No errors. Additional info: This issue is related to: https://jira.exoplatform.org/browse/JCR-2292 and https://community.jboss.org/message/868977#868977 a first test with JCR 1.6.1 and increase on jgroups pools seems to solve the issue.
Created attachment 926021 [details] JMeter test for 250 users JMeter test for cluster configured in localhost, with 8080 port for node1 and 8180 port for node2. No need to configure a loadbalancer in front to reproduce test, as JMeter distributes requests on both nodes.
Update from tests ran locally: - Build a JBoss Portal 6.2.ER4 version with eXo JCR 1.6.1 upgrade. - Setting these parameters before local tests on each node: · standalone-ha.xml: <connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http" max-connections="2500"/> · gatein-udp.xml: thread_pool.max_threads="250" oob_thread_pool.max_threads="250" Results: - Timeout mitigated for same 250 users tests. - Consistency on JMeter results (several tests): #1: Make Report = 394518 in 900s = 438,4/s Avg: 282 Min: 3 Max: 149262 Err: 0 (0,00%) #2: Make Report = 481386 in 900s = 534,8/s Avg: 132 Min: 3 Max: 142365 Err: 0 (0,00%) #3: Make Report = 490973 in 908s = 541,0/s Avg: 96 Min: 3 Max: 121066 Err: 0 (0,00%) #4: Make Report = 462153 in 900s = 513,6/s Avg: 122 Min: 3 Max: 120416 Err: 0 (0,00%) #5: Make Report = 493668 in 900s = 548,5/s Avg: 116 Min: 3 Max: 166606 Err: 0 (0,00%) - Test #1 is considered as warmup. - These tests are local, sharing resources with JMeter.
Update from tests ran locally: - Build a JBoss Portal 6.2.ER4 version with eXo JCR 1.6.1 upgrade. · standalone.conf: -Xms2048m -Xmx2048m · standalone-ha.xml: <connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http" max-connections="2500"/> · gatein-udp.xml: thread_pool.max_threads="500" thread_pool.rejection_policy="Discard" oob_thread_pool.max_threads="500" oob_thread_pool.rejection_policy="Discard" Results: - Timeout mitigated for same 1000 users tests (anonymous test). - Consistency on JMeter results (several tests): #1: Make Report = 355620 in 1073s = 331,5/s Avg: 2309 Min: 3 Max: 562485 Err: 0 (0,00%) #2: Make Report = 425038 in 924s = 460,0/s Avg: 1399 Min: 3 Max: 433826 Err: 0 (0,00%) #3: Make Report = 411449 in 1039s = 396,2/s Avg: 1774 Min: 3 Max: 690993 Err: 0 (0,00%) #4: Make Report = 396284 in 996s = 398,1/s Avg: 1337 Min: 2 Max: 394994 Err: 0 (0,00%) #5: Make Report = 431482 in 995s = 433,7/s Avg: 1534 Min: 3 Max: 519495 Err: 0 (0,00%) # Total tests: 3127476 # Long requests: 14432 (0.46146 % - more than 10 seconds) - Test #1 is considered as warmup. - These tests are local, sharing resources with JMeter.
Created attachment 928778 [details] Scalability script used with JMeter
Created attachment 928934 [details] stack traces from 2-node anonymous scalability test Attaching stack traces from a 2-node scalability test. - Node1 (perf11 machine): server.log has 170 MB; 99 % of the log contains SuspectExceptions. - Node2 (perf12 machine): server.log has 100 MB; first 30-40 % of the log is locking timeout exceptions; the rest contains JCR exceptions.
Created attachment 928938 [details] FailoverPortal Important note: The anonymous scalability tests in MW lab are confingured to put load on /portal/FailoverPortal/FailoverPage instead of /portal/classic/home. Attaching the FailoverPortal definition and failover portlet. (The failover portlet which is on the page is mainly used for failover testing. It increments a replicated variable which is checked by the test after the failover.)
Created attachment 928977 [details] stack traces with detected deadlocks When deadlock detection is enabled many deadlocks are reported on one of the portal nodes by infinispan: - node1 log: 300 MB; first 50% are deadlock exceptions the rest are locking timeouts - node2 log: 500 MB; 2x remote exception; the rest is locking/tx exceptions https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/epp6_portal_scalability_anonymous_2nodes/127/
Additional info: Deadlocks were detected with both optimistic and pessimistic locking mode. (In reply to Tomas Kyjovsky from comment #8) > Created attachment 928977 [details] > stack traces with detected deadlocks > > When deadlock detection is enabled many deadlocks are reported on one of the > portal nodes by infinispan: > - node1 log: 300 MB; first 50% are deadlock exceptions the rest are locking > timeouts > - node2 log: 500 MB; 2x remote exception; the rest is locking/tx exceptions > > https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/ > epp6_portal_scalability_anonymous_2nodes/127/
Created attachment 931222 [details] perf11 machine server.log with JGroups in TRACE level
Created attachment 931223 [details] perf12 machine server.log with JGroups in TRACE level
Created attachment 931375 [details] Perf11 threaddump
Related: Similar issue: https://www.mail-archive.com/infinispan-dev%40lists.jboss.org/msg08128.html
https://issues.jboss.org/browse/ISPN-4672
Changes applied in last tests: JBOSS_PORTAL_CONF=$JBOSS_PORTAL/gatein/gatein.ear/portal.war/WEB-INF/conf - $JBOSS_PORTAL_CONF/infinispan/cluster/config.xml Change sync configuration per async: <async useReplQueue="true" /> - $JBOSS_PORTAL_CONF/common/common-configuration.xml Increase defaultWakeUpInterval in GenericExoCacheCreator object: <field name="defaultWakeUpInterval"><long>30000</long></field> - $JBOSS_PORTAL_CONF/portal/portal-configuration.xml Increase lifespam value in GenericExoCacheConfig object: <field name="lifespan"><long>${gatein.cache.page.expiration:3600000}</long></field> Results: - Reducin hits on cache for checking eviction entries, results are under expected limits.
I thing the problem here is closer to https://issues.jboss.org/browse/ISPN-2392 Infinispan 5.2.10 in replicated mode was still acquiring the key locks on all the nodes in parallel, so it was possible to get deadlocks. But the primary owner didn't do any RPCs while holding the locks, so ISPN-4672 doesn't apply. I would also recommend setting <transaction useSynchronization="false"/>. With the default setting, errors during the commit phase are suppressed in the TM, and that's usually not a good thing. Another thing I noticed in your configuration is that the queue for the OOB thread pool is enabled. We recommend disabling the OOB queue, it can make things worse when a commit or state transfer command is queued after a lot of prepare commands.
I can still see some TimeoutExceptions during "logged users" scalability tests for CR1 and CR2, however it's only sporadic occurence compared to ER4. I don't consider this a blocker especially because the performance improved but I think should investigate a bit more after the GA.
In the next ER we have a downgrade of JCR component to 1.15.x version. So clustering issues shouldn't appear with that version.