1029010 – Session failover not working, reporting TimeoutException in ClusteredSession.access

Bug 1029010 - Session failover not working, reporting TimeoutException in ClusteredSession.access

Summary: Session failover not working, reporting TimeoutException in ClusteredSession....

Keywords:
Status:	CLOSED DUPLICATE of bug 993041
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	Clustering
Sub Component:
Version:	6.2.0
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Paul Ferraro
QA Contact:	Jitka Kozana
Docs Contact:	Russell Dickenson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-11-11 13:25 UTC by Hynek Mlnarik
Modified:	2015-09-01 03:34 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-12-12 16:53:53 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Description Hynek Mlnarik 2013-11-11 13:25:01 UTC

Seen the following exception while testing 6.2.0.ER7 in Amazon EC2. Likely unrelated to EC2 though, as it seems very similar to BZ 993041.

[JBossINF] 09:52:42,723 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[default-host]] (ajp-ec2-54-234-10-125.compute-1.amazonaws.com/10.240.62.148:8009-27) JBWEB000233: Exception sending request initialized lifecycle event to listener instance of class org.jboss.weld.servlet.WeldListener: java.lang.RuntimeException: JBAS018060: Exception acquiring ownership of Ii1200JBAh6THJcuF44u0mbC
[JBossINF]      at org.jboss.as.web.session.ClusteredSession.access(ClusteredSession.java:494) [jboss-as-web.jar:7.3.0.Final-redhat-10]
[JBossINF]      at org.apache.catalina.connector.Request.doGetSession(Request.java:2616) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.catalina.connector.Request.getSession(Request.java:2361) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:790) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.getSession(LazySessionBeanStore.java:73) [weld-core.jar:1.1.16.Final-redhat-1]
[JBossINF]      at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.<init>(LazySessionBeanStore.java:59) [weld-core.jar:1.1.16.Final-redhat-1]
[JBossINF]      at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:28) [weld-core.jar:1.1.16.Final-redhat-1]
[JBossINF]      at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:16) [weld-core.jar:1.1.16.Final-redhat-1]
[JBossINF]      at org.jboss.weld.servlet.WeldListener.requestInitialized(WeldListener.java:145) [weld-core.jar:1.1.16.Final-redhat-1]
[JBossINF]      at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:134) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:488) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.coyote.ajp.AjpProtocol$AjpConnectionHandler.process(AjpProtocol.java:420) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920) [jbossweb.jar:7.2.2.Final-redhat-1]
[JBossINF]      at java.lang.Thread.run(Thread.java:679) [rt.jar:1.6.0_24]
[JBossINF] Caused by: org.jboss.as.clustering.lock.TimeoutException: JBAS010213: Cannot acquire lock default-host/clusterbench/Ii1200JBAh6THJcuF44u0mbC from cluster
[JBossINF]      at org.jboss.as.clustering.lock.SharedLocalYieldingClusterLockManager.lock(SharedLocalYieldingClusterLockManager.java:439)
[JBossINF]      at org.jboss.as.clustering.web.infinispan.DistributedCacheManager.acquireSessionOwnership(DistributedCacheManager.java:400)
[JBossINF]      at org.jboss.as.web.session.ClusteredSession.acquireSessionOwnership(ClusteredSession.java:520) [jboss-as-web.jar:7.3.0.Final-redhat-10]
[JBossINF]      at org.jboss.as.web.session.ClusteredSession.access(ClusteredSession.java:492) [jboss-as-web.jar:7.3.0.Final-redhat-10]
[JBossINF]      ... 16 more

The exception is thrown upon taking one of the server in the cluster down. Sessions that belonged to the failed node seem to go in vain with the server: new sessions are created upon attempt to access server using already existing sessions, while it is expected that failover would proceed the request. This implies the client will lose all their session data.

The web cache-container default cache is "repl".

Comment 3 Aleksandar Kostadinov 2013-11-12 20:02:00 UTC

There are also other relevant server exceptions. In my opinion they are related to the exception in the bug description so I'm not filing another issue but adding to this one.

> 09:52:43,121 WARN  [org.jboss.as.clustering.web.infinispan] (Incoming-8,shared=tcp) JBAS010325: Possible concurrency problem: Replicated version id 31 is less than or equal to in-memory version for session jt-Hvt0b9x0YDY088h83XR8h
...
> 09:54:26,168 ERROR [org.jboss.as.clustering] (Incoming-1,shared=tcp) JBAS010223: ViewAccepted failed: java.lang.IllegalStateException: JBAS010240: Address 8abf6d5e-45dc-c7dc-2f86-72794f5f87ff not registered in transport layer
...
> 09:54:26,687 ERROR [org.jboss.as.clustering] (Incoming-10,shared=tcp) JBAS010223: ViewAccepted failed: java.lang.IllegalStateException: JBAS010240: Address a87af105-2853-f02d-cee2-041c74bd1399 not registered in transport layer
...
> 09:54:26,825 WARN  [org.jgroups.protocols.TCP] (ajp-ec2-54-234-10-125.compute-1.amazonaws.com/10.240.62.148:8009-39) null: logical address cache didn't contain all physical address, sending up a discovery request
...

All the above lines are from config/jboss-ec2-54-234-10-125.compute-1.amazonaws.com/server.log
Server logs of all servers participating in the test scenario can be found in the already attached archive. FYI during the test there were 4 servers in a cluster and 200 concurrent clients. One by one servers are killed (kill -9) and then started again.

I see failures after 3 of the four server kills. Which is IMO very strange. I'm running the same test dozens of times without being able to reproduce. So having this issue happen 3 out of 4 times in a single test must mean something. I'll be running the test till tomorrow but I'm losing hope we can reproduce.

P.S. for easier log reading, here are the times at which each server was killed:
> domU-12-31-39-04-39-66_sf.log:2013/11/06 09:55:36:358 EST [DEBUG][TerminatorThread] HOST domU-12-31-39-07-C5-96.compute-1.internal:rootProcess:failoverTestExecution - Issuing command: sh -c kill -9 `ps xfu | grep 'o8POK1sF4CPtw7QU' | grep -v 'grep' | sed -re '1,$s/[ \t]+/ /g' | cut -d ' ' -f 2`;
> domU-12-31-39-06-90-08_sf.log:2013/11/06 09:51:49:086 EST [DEBUG][TerminatorThread] HOST domU-12-31-39-07-C5-96.compute-1.internal:rootProcess:failoverTestExecution - Issuing command: sh -c kill -9 `ps xfu | grep 'HLDqw0ark9XcMNCM' | grep -v 'grep' | sed -re '1,$s/[ \t]+/ /g' | cut -d ' ' -f 2`;
> ip-10-138-52-150_sf.log:2013/11/06 10:03:05:153 EST [DEBUG][TerminatorThread] HOST domU-12-31-39-07-C5-96.compute-1.internal:rootProcess:failoverTestExecution - Issuing command: sh -c kill -9 `ps xfu | grep 'Hdf06hHdi5gDdU7g' | grep -v 'grep' | sed -re '1,$s/[ \t]+/ /g' | cut -d ' ' -f 2`;
> ip-10-151-106-31_sf.log:2013/11/06 09:59:19:216 EST [DEBUG][TerminatorThread] HOST domU-12-31-39-07-C5-96.compute-1.internal:rootProcess:failoverTestExecution - Issuing command: sh -c kill -9 `ps xfu | grep '1aMS9DHcEPZFda3K' | grep -v 'grep' | sed -re '1,$s/[ \t]+/ /g' | cut -d ' ' -f 2`;

Comment 4 Aleksandar Kostadinov 2013-11-15 12:34:43 UTC

Running for extended periods of time I couldn't reproduce the above issue. The only thing is that in 3 runs relatively close to each other in time, I got total of 4 failed samples:
> SFCORE_LOG - Error sampling data: <org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 140, received 139, Runner: 186>

I think this is unrelated though and expected to happen rarely.

Comment 6 Paul Ferraro 2013-12-12 16:53:53 UTC


*** This bug has been marked as a duplicate of bug 993041 ***

Note You need to log in before you can comment on or make changes to this bug.