Bug 1377744 - Fatal performance drop and client getting 503 after various errors occur (high load, distributed cache)
Summary: Fatal performance drop and client getting 503 after various errors occur (hig...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Clustering
Version: 6.4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: jboss-set
QA Contact: Michal Vinkler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-20 13:58 UTC by Michal Vinkler
Modified: 2019-03-01 12:28 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-01 12:28:23 UTC
Type: Bug


Attachments (Terms of Use)
Performance graph (152.47 KB, image/png)
2016-09-20 13:58 UTC, Michal Vinkler
no flags Details

Description Michal Vinkler 2016-09-20 13:58:53 UTC
Created attachment 1202903 [details]
Performance graph

We can see a performance drop and clients getting 503 as a response in our stress tests, which test performance of the cluster under increasing load (number of concurrent clients). The issue was hit when using an invalidation cache + a shared cache store.

Scenario description:
HTTP traffic accessing clustered web application that has replicated sessions (uses a mod_cluster load balancer). Delay between sending a new request after receiving a response is 100 ms (for each client). Session size is 17 KB.

Configuration:
 - 4-node EAP cluster with an invalidation cache + a shared cache store (remote JDG cluster)
 - 4-node JDG cluster with distributed cache
 - 4 nodes generating load (number of clients increases from 400 to 6000) 
 - cache mode: SYNC (for both invalidation and distributed caches, also "write-behind" element is set for "remote-store" element accordingly)

Links to configuration files:
SYNC scenario:
EAP http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-stress-remote-jdg-session-invalidation-sync-4nodes-perf17/9/artifact/report/config/jboss-perf18/standalone-ha.xml
JDG http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-stress-remote-jdg-session-invalidation-sync-4nodes-perf17/9/artifact/report/config/jboss-perf22/clustered.xml

EAP cluster responded normally until the moment when a load of 5600 concurrent clients was reached. After that, these messages were logged by all the EAP servers:


[JBossINF] [0m[0m09:41:43,498 INFO  [org.apache.tomcat.util] (ajp-perf19/10.16.90.56:8009-Acceptor-0) JBWEB003008: Maximum number of threads (4096) created for connector with address perf19/10.16.90.56 and port 8009
[JBossINF] [0m[33m09:41:44,346 WARN  [org.jboss.as.clustering.web.infinispan] (OOB-20286,shared=tcp) JBAS010325: Possible concurrency problem: Replicated version id 1772 is less than or equal to in-memory version for session 8LjrXm+TGZ-a4f-cPOUIvASX
[JBossINF] [0m[33m09:42:09,949 WARN  [org.jgroups.protocols.UNICAST2] (OOB-21385,shared=tcp) perf19/web: (requester=perf21/web) message perf21/web::14456540 not found in retransmission table of perf21/web:
[JBossINF] [14456345 | 14456345 | 14457490] (811 elements, 334 missing)

Log:
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-stress-remote-jdg-session-invalidation-sync-4nodes-perf17/9/console-perf19/

From client point of view, jvmRoute started changing around that time:
2016/09/19 09:41:31:166 EDT [INFO ][Runner - 906] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Failover detected, JvmRoute changed. perf20 -> perf18

And later, clients started getting 503. 

Log:
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-stress-remote-jdg-session-invalidation-sync-4nodes-perf17/9/console-perf29/

Also, approx. one minute later, perf20 started logging this error:

[JBossINF] [0m[31m09:43:43,622 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[default-host]] (ajp-perf20/10.16.90.58:8009-2946) JBWEB000233: Exception sending request initialized lifecycle event to listener instance of class org.jboss.weld.servlet.WeldListener: java.lang.RuntimeException: JBAS018060: Exception acquiring ownership of ycRzHga1gII1klBJ1vBW5By1
[JBossINF] 	at org.jboss.as.web.session.ClusteredSession.access(ClusteredSession.java:494) [jboss-as-web-7.5.8.Final-redhat-2-bz-1370648.jar:7.5.8.Final-redhat-SNAPSHOT]
[JBossINF] 	at org.apache.catalina.connector.Request.doGetSession(Request.java:2638) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.connector.Request.getSession(Request.java:2382) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:791) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.getSession(LazySessionBeanStore.java:73) [weld-core-1.1.33.Final-redhat-1.jar:1.1.33.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.<init>(LazySessionBeanStore.java:59) [weld-core-1.1.33.Final-redhat-1.jar:1.1.33.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:39) [weld-core-1.1.33.Final-redhat-1.jar:1.1.33.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:22) [weld-core-1.1.33.Final-redhat-1.jar:1.1.33.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.servlet.WeldListener.requestInitialized(WeldListener.java:212) [weld-core-1.1.33.Final-redhat-1.jar:1.1.33.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:344) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:490) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.coyote.ajp.AjpProtocol$AjpConnectionHandler.process(AjpProtocol.java:420) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:926) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_91]
[JBossINF] Caused by: org.jboss.as.clustering.lock.TimeoutException: JBAS010213: Cannot acquire lock default-host/clusterbench-offload/ycRzHga1gII1klBJ1vBW5By1 from cluster
[JBossINF] 	at org.jboss.as.clustering.lock.SharedLocalYieldingClusterLockManager.lock(SharedLocalYieldingClusterLockManager.java:455)
[JBossINF] 	at org.jboss.as.clustering.web.infinispan.DistributedCacheManager.acquireSessionOwnership(DistributedCacheManager.java:400)
[JBossINF] 	at org.jboss.as.web.session.ClusteredSession.acquireSessionOwnership(ClusteredSession.java:520) [jboss-as-web-7.5.8.Final-redhat-2-bz-1370648.jar:7.5.8.Final-redhat-SNAPSHOT]
[JBossINF] 	at org.jboss.as.web.session.ClusteredSession.access(ClusteredSession.java:492) [jboss-as-web-7.5.8.Final-redhat-2-bz-1370648.jar:7.5.8.Final-redhat-SNAPSHOT]
[JBossINF] 	... 16 more

Log:
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-stress-remote-jdg-session-invalidation-sync-4nodes-perf17/9/console-perf20/

See the attachment "Performance graph" for graphs with results.

Link to full report with extended CPU and memory statistics:
http://download.eng.brq.redhat.com/scratch/mvinkler/reports/2016-09-20_15-55-35/stress.html

Horizontal axis represents number of concurrent clients generating load.
Vertical axis represents throughput = TPS (number of requests processed per second). 
Red line represents total TPS. Blue line represents number of requests requests processed in less than 3 seconds.

In short:
After reaching load of ~5600 clients, throughput dropped significantly and clients started logging errors.

Comment 1 Michal Vinkler 2016-09-21 10:30:32 UTC
Correction: the invalidation cache was not used in the above mentioned run, so the default cache "dist" in the "web" cache container was used. Nevertheless, these issues are still valid.


Note You need to log in before you can comment on or make changes to this bug.