Bug 1377744

Summary: Fatal performance drop and client getting 503 after various errors occur (high load, distributed cache)
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Michal Vinkler <mvinkler>
Component: ClusteringAssignee: jboss-set
Status: CLOSED WONTFIX QA Contact: Michal Vinkler <mvinkler>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.4.8CC: paul.ferraro
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-01 12:28:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Performance graph none

Description Michal Vinkler 2016-09-20 13:58:53 UTC
Created attachment 1202903 [details]
Performance graph

We can see a performance drop and clients getting 503 as a response in our stress tests, which test performance of the cluster under increasing load (number of concurrent clients). The issue was hit when using an invalidation cache + a shared cache store.

Scenario description:
HTTP traffic accessing clustered web application that has replicated sessions (uses a mod_cluster load balancer). Delay between sending a new request after receiving a response is 100 ms (for each client). Session size is 17 KB.

Configuration:
 - 4-node EAP cluster with an invalidation cache + a shared cache store (remote JDG cluster)
 - 4-node JDG cluster with distributed cache
 - 4 nodes generating load (number of clients increases from 400 to 6000) 
 - cache mode: SYNC (for both invalidation and distributed caches, also "write-behind" element is set for "remote-store" element accordingly)

Links to configuration files:
SYNC scenario:
EAP http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-stress-remote-jdg-session-invalidation-sync-4nodes-perf17/9/artifact/report/config/jboss-perf18/standalone-ha.xml
JDG http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-stress-remote-jdg-session-invalidation-sync-4nodes-perf17/9/artifact/report/config/jboss-perf22/clustered.xml

EAP cluster responded normally until the moment when a load of 5600 concurrent clients was reached. After that, these messages were logged by all the EAP servers:


[JBossINF] [0m[0m09:41:43,498 INFO  [org.apache.tomcat.util] (ajp-perf19/10.16.90.56:8009-Acceptor-0) JBWEB003008: Maximum number of threads (4096) created for connector with address perf19/10.16.90.56 and port 8009
[JBossINF] [0m[33m09:41:44,346 WARN  [org.jboss.as.clustering.web.infinispan] (OOB-20286,shared=tcp) JBAS010325: Possible concurrency problem: Replicated version id 1772 is less than or equal to in-memory version for session 8LjrXm+TGZ-a4f-cPOUIvASX
[JBossINF] [0m[33m09:42:09,949 WARN  [org.jgroups.protocols.UNICAST2] (OOB-21385,shared=tcp) perf19/web: (requester=perf21/web) message perf21/web::14456540 not found in retransmission table of perf21/web:
[JBossINF] [14456345 | 14456345 | 14457490] (811 elements, 334 missing)

Log:
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-stress-remote-jdg-session-invalidation-sync-4nodes-perf17/9/console-perf19/

From client point of view, jvmRoute started changing around that time:
2016/09/19 09:41:31:166 EDT [INFO ][Runner - 906] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Failover detected, JvmRoute changed. perf20 -> perf18

And later, clients started getting 503. 

Log:
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-stress-remote-jdg-session-invalidation-sync-4nodes-perf17/9/console-perf29/

Also, approx. one minute later, perf20 started logging this error:

[JBossINF] [0m[31m09:43:43,622 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[default-host]] (ajp-perf20/10.16.90.58:8009-2946) JBWEB000233: Exception sending request initialized lifecycle event to listener instance of class org.jboss.weld.servlet.WeldListener: java.lang.RuntimeException: JBAS018060: Exception acquiring ownership of ycRzHga1gII1klBJ1vBW5By1
[JBossINF] 	at org.jboss.as.web.session.ClusteredSession.access(ClusteredSession.java:494) [jboss-as-web-7.5.8.Final-redhat-2-bz-1370648.jar:7.5.8.Final-redhat-SNAPSHOT]
[JBossINF] 	at org.apache.catalina.connector.Request.doGetSession(Request.java:2638) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.connector.Request.getSession(Request.java:2382) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:791) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.getSession(LazySessionBeanStore.java:73) [weld-core-1.1.33.Final-redhat-1.jar:1.1.33.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.<init>(LazySessionBeanStore.java:59) [weld-core-1.1.33.Final-redhat-1.jar:1.1.33.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:39) [weld-core-1.1.33.Final-redhat-1.jar:1.1.33.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:22) [weld-core-1.1.33.Final-redhat-1.jar:1.1.33.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.servlet.WeldListener.requestInitialized(WeldListener.java:212) [weld-core-1.1.33.Final-redhat-1.jar:1.1.33.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:344) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:490) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.coyote.ajp.AjpProtocol$AjpConnectionHandler.process(AjpProtocol.java:420) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:926) [jbossweb-7.5.16.Final-redhat-1.jar:7.5.16.Final-redhat-1]
[JBossINF] 	at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_91]
[JBossINF] Caused by: org.jboss.as.clustering.lock.TimeoutException: JBAS010213: Cannot acquire lock default-host/clusterbench-offload/ycRzHga1gII1klBJ1vBW5By1 from cluster
[JBossINF] 	at org.jboss.as.clustering.lock.SharedLocalYieldingClusterLockManager.lock(SharedLocalYieldingClusterLockManager.java:455)
[JBossINF] 	at org.jboss.as.clustering.web.infinispan.DistributedCacheManager.acquireSessionOwnership(DistributedCacheManager.java:400)
[JBossINF] 	at org.jboss.as.web.session.ClusteredSession.acquireSessionOwnership(ClusteredSession.java:520) [jboss-as-web-7.5.8.Final-redhat-2-bz-1370648.jar:7.5.8.Final-redhat-SNAPSHOT]
[JBossINF] 	at org.jboss.as.web.session.ClusteredSession.access(ClusteredSession.java:492) [jboss-as-web-7.5.8.Final-redhat-2-bz-1370648.jar:7.5.8.Final-redhat-SNAPSHOT]
[JBossINF] 	... 16 more

Log:
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-stress-remote-jdg-session-invalidation-sync-4nodes-perf17/9/console-perf20/

See the attachment "Performance graph" for graphs with results.

Link to full report with extended CPU and memory statistics:
http://download.eng.brq.redhat.com/scratch/mvinkler/reports/2016-09-20_15-55-35/stress.html

Horizontal axis represents number of concurrent clients generating load.
Vertical axis represents throughput = TPS (number of requests processed per second). 
Red line represents total TPS. Blue line represents number of requests requests processed in less than 3 seconds.

In short:
After reaching load of ~5600 clients, throughput dropped significantly and clients started logging errors.

Comment 1 Michal Vinkler 2016-09-21 10:30:32 UTC
Correction: the invalidation cache was not used in the above mentioned run, so the default cache "dist" in the "web" cache container was used. Nevertheless, these issues are still valid.