Bug 1029010

Summary: Session failover not working, reporting TimeoutException in ClusteredSession.access
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Hynek Mlnarik <hmlnarik>
Component: ClusteringAssignee: Paul Ferraro <paul.ferraro>
Status: CLOSED DUPLICATE QA Contact: Jitka Kozana <jkudrnac>
Severity: high Docs Contact: Russell Dickenson <rdickens>
Priority: unspecified    
Version: 6.2.0CC: akostadi, pslavice, rhusar
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-12-12 16:53:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hynek Mlnarik 2013-11-11 13:25:01 UTC
Seen the following exception while testing 6.2.0.ER7 in Amazon EC2. Likely unrelated to EC2 though, as it seems very similar to BZ 993041.

[JBossINF] 09:52:42,723 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[default-host]] (ajp-ec2-54-234-10-125.compute-1.amazonaws.com/10.240.62.148:8009-27) JBWEB000233: Exception sending request initialized lifecycle event to listener instance of class org.jboss.weld.servlet.WeldListener: java.lang.RuntimeException: JBAS018060: Exception acquiring ownership of Ii1200JBAh6THJcuF44u0mbC
[JBossINF]      at org.jboss.as.web.session.ClusteredSession.access(ClusteredSession.java:494) [jboss-as-web.jar:7.3.0.Final-redhat-10]
[JBossINF]      at org.apache.catalina.connector.Request.doGetSession(Request.java:2616) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.catalina.connector.Request.getSession(Request.java:2361) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:790) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.getSession(LazySessionBeanStore.java:73) [weld-core.jar:1.1.16.Final-redhat-1]
[JBossINF]      at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.<init>(LazySessionBeanStore.java:59) [weld-core.jar:1.1.16.Final-redhat-1]
[JBossINF]      at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:28) [weld-core.jar:1.1.16.Final-redhat-1]
[JBossINF]      at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:16) [weld-core.jar:1.1.16.Final-redhat-1]
[JBossINF]      at org.jboss.weld.servlet.WeldListener.requestInitialized(WeldListener.java:145) [weld-core.jar:1.1.16.Final-redhat-1]
[JBossINF]      at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:134) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:488) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.coyote.ajp.AjpProtocol$AjpConnectionHandler.process(AjpProtocol.java:420) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at java.lang.Thread.run(Thread.java:679) [rt.jar:1.6.0_24]
[JBossINF] Caused by: org.jboss.as.clustering.lock.TimeoutException: JBAS010213: Cannot acquire lock default-host/clusterbench/Ii1200JBAh6THJcuF44u0mbC from cluster
[JBossINF]      at org.jboss.as.clustering.lock.SharedLocalYieldingClusterLockManager.lock(SharedLocalYieldingClusterLockManager.java:439)
[JBossINF]      at org.jboss.as.clustering.web.infinispan.DistributedCacheManager.acquireSessionOwnership(DistributedCacheManager.java:400)
[JBossINF]      at org.jboss.as.web.session.ClusteredSession.acquireSessionOwnership(ClusteredSession.java:520) [jboss-as-web.jar:7.3.0.Final-redhat-10]
[JBossINF]      at org.jboss.as.web.session.ClusteredSession.access(ClusteredSession.java:492) [jboss-as-web.jar:7.3.0.Final-redhat-10]
[JBossINF]      ... 16 more

The exception is thrown upon taking one of the server in the cluster down. Sessions that belonged to the failed node seem to go in vain with the server: new sessions are created upon attempt to access server using already existing sessions, while it is expected that failover would proceed the request. This implies the client will lose all their session data.

The web cache-container default cache is "repl".

Comment 3 Aleksandar Kostadinov 2013-11-12 20:02:00 UTC
There are also other relevant server exceptions. In my opinion they are related to the exception in the bug description so I'm not filing another issue but adding to this one.

> 09:52:43,121 WARN  [org.jboss.as.clustering.web.infinispan] (Incoming-8,shared=tcp) JBAS010325: Possible concurrency problem: Replicated version id 31 is less than or equal to in-memory version for session jt-Hvt0b9x0YDY088h83XR8h
...
> 09:54:26,168 ERROR [org.jboss.as.clustering] (Incoming-1,shared=tcp) JBAS010223: ViewAccepted failed: java.lang.IllegalStateException: JBAS010240: Address 8abf6d5e-45dc-c7dc-2f86-72794f5f87ff not registered in transport layer
...
> 09:54:26,687 ERROR [org.jboss.as.clustering] (Incoming-10,shared=tcp) JBAS010223: ViewAccepted failed: java.lang.IllegalStateException: JBAS010240: Address a87af105-2853-f02d-cee2-041c74bd1399 not registered in transport layer
...
> 09:54:26,825 WARN  [org.jgroups.protocols.TCP] (ajp-ec2-54-234-10-125.compute-1.amazonaws.com/10.240.62.148:8009-39) null: logical address cache didn't contain all physical address, sending up a discovery request
...

All the above lines are from config/jboss-ec2-54-234-10-125.compute-1.amazonaws.com/server.log
Server logs of all servers participating in the test scenario can be found in the already attached archive. FYI during the test there were 4 servers in a cluster and 200 concurrent clients. One by one servers are killed (kill -9) and then started again.

I see failures after 3 of the four server kills. Which is IMO very strange. I'm running the same test dozens of times without being able to reproduce. So having this issue happen 3 out of 4 times in a single test must mean something. I'll be running the test till tomorrow but I'm losing hope we can reproduce.

P.S. for easier log reading, here are the times at which each server was killed:
> domU-12-31-39-04-39-66_sf.log:2013/11/06 09:55:36:358 EST [DEBUG][TerminatorThread] HOST domU-12-31-39-07-C5-96.compute-1.internal:rootProcess:failoverTestExecution - Issuing command: sh -c kill -9 `ps xfu | grep 'o8POK1sF4CPtw7QU' | grep -v 'grep' | sed -re '1,$s/[ \t]+/ /g' | cut -d ' ' -f 2`;
> domU-12-31-39-06-90-08_sf.log:2013/11/06 09:51:49:086 EST [DEBUG][TerminatorThread] HOST domU-12-31-39-07-C5-96.compute-1.internal:rootProcess:failoverTestExecution - Issuing command: sh -c kill -9 `ps xfu | grep 'HLDqw0ark9XcMNCM' | grep -v 'grep' | sed -re '1,$s/[ \t]+/ /g' | cut -d ' ' -f 2`;
> ip-10-138-52-150_sf.log:2013/11/06 10:03:05:153 EST [DEBUG][TerminatorThread] HOST domU-12-31-39-07-C5-96.compute-1.internal:rootProcess:failoverTestExecution - Issuing command: sh -c kill -9 `ps xfu | grep 'Hdf06hHdi5gDdU7g' | grep -v 'grep' | sed -re '1,$s/[ \t]+/ /g' | cut -d ' ' -f 2`;
> ip-10-151-106-31_sf.log:2013/11/06 09:59:19:216 EST [DEBUG][TerminatorThread] HOST domU-12-31-39-07-C5-96.compute-1.internal:rootProcess:failoverTestExecution - Issuing command: sh -c kill -9 `ps xfu | grep '1aMS9DHcEPZFda3K' | grep -v 'grep' | sed -re '1,$s/[ \t]+/ /g' | cut -d ' ' -f 2`;

Comment 4 Aleksandar Kostadinov 2013-11-15 12:34:43 UTC
Running for extended periods of time I couldn't reproduce the above issue. The only thing is that in 3 runs relatively close to each other in time, I got total of 4 failed samples:
> SFCORE_LOG - Error sampling data: <org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 140, received 139, Runner: 186>

I think this is unrelated though and expected to happen rarely.

Comment 6 Paul Ferraro 2013-12-12 16:53:53 UTC

*** This bug has been marked as a duplicate of bug 993041 ***