Bug 990102

Summary:	Concurrent access timeout -- could not obtain lock within 5000 MILLISECONDS
Product:	[JBoss] JBoss Enterprise Application Platform 6	Reporter:	Ladislav Thon <lthon>
Component:	EJB	Assignee:	David M. Lloyd <david.lloyd>
Status:	CLOSED EOL	QA Contact:	Michal Vinkler <mvinkler>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	6.1.1	CC:	anmiller, cdewolf, dandread, david.lloyd, ehugonne, jawilson, jkudrnac, jmartisk, lthon, myarboro, paul.ferraro, rhusar, thofman
Target Milestone:	---
Target Release:	EAP 6.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	This release of JBoss EAP 6 carries a bug that could produce a `concurrent access timeout` when an EJB client invoking a method on a stateful bean in a "forwarding" cluster; this bean forwards the call to stateful beans in a "target" cluster, and then back again. Invocations are serial; the client will not invoke a method on a bean until it got a response to previous invocation. When one of the servers in the cluster is shut down, the error occurs.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-19 12:44:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ladislav Thon 2013-07-30 11:38:36 UTC

This problem happens in a seemingly innocent scenario: EJB client invoking method on stateful bean in a "forwarding" cluster (2 nodes), this bean just forwards the call to stateful beans in a "target" cluster (2 nodes), and then back again. Invocations are serial; the client won't invoke a method on a bean until it got a response to previous invocation. During the test, one of the target servers is shut down, which is when the problem starts to appear.

Yet, with EAP 6.1.1.ER3, I'm seeing this exception on the client:

2013/07/30 06:35:45:990 EDT [ERROR][Runner - 98] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error getting response. <javax.ejb.EJBTransactionRolledbackException: JBAS014360: EJB 3.1 FR 4.3.14.1 concurrent access timeout on org.jboss.invocation.InterceptorContext@1d31f5d0 - could not obtain lock within 5000 MILLISECONDS>
javax.ejb.EJBTransactionRolledbackException: JBAS014360: EJB 3.1 FR 4.3.14.1 concurrent access timeout on org.jboss.invocation.InterceptorContext@1d31f5d0 - could not obtain lock within 5000 MILLISECONDS
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleInCallerTx(CMTTxInterceptor.java:138)
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:228)
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:317)
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:214)
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
	at org.jboss.as.ejb3.remote.EJBRemoteTransactionPropagatingInterceptor.processInvocation(EJBRemoteTransactionPropagatingInterceptor.java:79)
	...

and on the servers (both in the forwarding and target clusters):

06:35:45,866 ERROR [org.jboss.as.ejb3.invocation] (EJB default - 1) JBAS014134: EJB Invocation failed on component RemoteStatefulSBImpl for method public abstract int org.jboss.test.clusterbench.common.ejb.CommonStatefulSB.getSerialAndIncrement(): javax.ejb.EJBTransactionRolledbackException: JBAS014360: EJB 3.1 FR 4.3.14.1 concurrent access timeout on org.jboss.invocation.InterceptorContext@1d31f5d0 - could not obtain lock within 5000 MILLISECONDS
 	at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleInCallerTx(CMTTxInterceptor.java:138) [jboss-as-ejb3-7.2.1.Final-redhat-4.jar:7.2.1.Final-redhat-4]
 	at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:228) [jboss-as-ejb3-7.2.1.Final-redhat-4.jar:7.2.1.Final-redhat-4]
 	at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:317) [jboss-as-ejb3-7.2.1.Final-redhat-4.jar:7.2.1.Final-redhat-4]
 	at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:214) [jboss-as-ejb3-7.2.1.Final-redhat-4.jar:7.2.1.Final-redhat-4]
 	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.2.Final-redhat-1.jar:1.1.2.Final-redhat-1]
 	at org.jboss.as.ejb3.remote.EJBRemoteTransactionPropagatingInterceptor.processInvocation(EJBRemoteTransactionPropagatingInterceptor.java:79) [jboss-as-ejb3-7.2.1.Final-redhat-4.jar:7.2.1.Final-redhat-4]
 	...

Here's a test run with 200 clients with log levels of org.jboss.ejb.client and org.jboss.as.ejb3 set to DEBUG: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async-timeout-investigation/10/

There's an interesting exception in the log of perf20 (which is in the forwarding cluster):

06:35:45,137 DEBUG [org.jboss.as.ejb3.remote.DescriptorBasedEJBClientContextService] (ejb-client-context-tasks-3-thread-7) Reconnect attempt#1 failed for outbound connection service jboss.remoting.endpoint.subsystem.outbound-connection.remote-ejb-connection: java.lang.RuntimeException: Operation failed with status WAITING
 	at org.jboss.ejb.client.remoting.IoFutureHelper.get(IoFutureHelper.java:93)
 	at org.jboss.as.ejb3.remote.DescriptorBasedEJBClientContextService$OutboundConnectionReconnectHandler.reconnect(DescriptorBasedEJBClientContextService.java:219)
 	at org.jboss.ejb.client.EJBClientContext$ReconnectAttempt.run(EJBClientContext.java:1287)
 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) [rt.jar:1.6.0_45]
 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) [rt.jar:1.6.0_45]
 	at java.util.concurrent.FutureTask.run(FutureTask.java:138) [rt.jar:1.6.0_45]
 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_45]
 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_45]
 	at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_45]
 	...

This is possibly related. There's another exception in the log of perf18 (the node that is being shut down):

06:35:36,747 WARN  [com.arjuna.ats.jta] (EJB default - 9) ARJUNA016029: SynchronizationImple.afterCompletion - failed for org.jboss.as.ejb3.component.stateful.StatefulSessionSynchronizationInterceptor$StatefulSessionSynchronization@5518a6e8 with exception: java.util.concurrent.RejectedExecutionException
 	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1774) [rt.jar:1.6.0_45]
 	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:768) [rt.jar:1.6.0_45]
 	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:215) [rt.jar:1.6.0_45]
 	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:397) [rt.jar:1.6.0_45]
 	at java.util.concurrent.Executors$DelegatedScheduledExecutorService.schedule(Executors.java:650) [rt.jar:1.6.0_45]
 	at org.jboss.as.ejb3.cache.impl.backing.PassivatingBackingCacheImpl.schedule(PassivatingBackingCacheImpl.java:293)
 	...

This doesn't look related to me.

Comment 1 Radoslav Husar 2013-07-30 11:42:32 UTC

Is this a regression from 6.1.0 GA?

Comment 5 Paul Ferraro 2013-08-01 19:56:59 UTC

After some investigation, I think there is a bug in StatefulSessionSynchronizationInterceptor, specifically in the registered Synchronization responsible for releasing the SFSB instance lock.  If the Synchronization callbacks throw an exception, the Lock.unlock() happens outside the context of a lock owner.  This would cause a subsequent Lock.tryLock(...) for a new owner to fail.

This seems to be an EJB bug and doesn't appear to be directly related to clustering.

Comment 6 Jitka Kozana 2013-08-02 05:02:36 UTC

Paul, thank you for the investigation. Reassigning to Jaikiran to take a look.

Comment 7 Paul Ferraro 2013-08-02 12:33:49 UTC

@Jaikiran I've opened and upstream jira for this as well:
https://issues.jboss.org/browse/WFLY-1810

Comment 10 Ladislav Thon 2013-08-07 07:47:26 UTC

The upstream JIRA (https://issues.jboss.org/browse/WFLY-1810) contains a link to a pull request against WildFly (https://github.com/wildfly/wildfly/pull/4876). I took the patch, applied it to EAP and ran the test again. No good, the issue is still there :-(

The run: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async-timeout-investigation/13/

Comment 11 Ladislav Thon 2013-08-07 08:12:59 UTC

Doh! I'm sorry, I incidentally used pure 6.1.1.ER4 without the fix in the build from comment 10.

I did another run, this time with the fix for real, but the problem persists. See https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async-timeout-investigation/14/

Comment 13 Emmanuel Hugonnet (ehsavoie) 2013-08-13 08:18:45 UTC

I'm wondering if the issue is not created by the server trying to recover the transactions during shutdown.

Comment 15 Dimitris Andreadis 2013-10-24 18:27:05 UTC

Assigning jpai EJB issues to david.lloyd. Please re-assign to Cheng or others as needed.

Comment 16 Jan Martiska 2014-07-02 08:00:22 UTC

@Láďo, I'm trying to get rid of old open bugzillas for EJB. Can you check that this still occurs in EAP 6.3? If yes, we can propose it for 6.4.

Comment 17 Ladislav Thon 2014-07-07 06:50:47 UTC

Yes, we're still seeing this, see e.g. https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async/39/ for EAP 6.3.0.ER9. Proposing for 6.4.

Comment 19 Scott Mumford 2014-07-09 04:06:47 UTC

Ladislav, David,

Could one of you please provide a draft release note in the Doc Text field above? I'm having trouble parsing the circumstances/consequences of the issue.

Much appreciated.

Comment 21 Ladislav Thon 2014-09-11 15:06:14 UTC

There are 2 clusters, each of them having 2 nodes.