This problem happens in a seemingly innocent scenario: EJB client invoking method on stateful bean in a "forwarding" cluster (2 nodes), this bean just forwards the call to stateful beans in a "target" cluster (2 nodes), and then back again. Invocations are serial; the client won't invoke a method on a bean until it got a response to previous invocation. During the test, one of the target servers is shut down, which is when the problem starts to appear. Yet, with EAP 6.1.1.ER3, I'm seeing this exception on the client: 2013/07/30 06:35:45:990 EDT [ERROR][Runner - 98] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error getting response. <javax.ejb.EJBTransactionRolledbackException: JBAS014360: EJB 3.1 FR 4.3.14.1 concurrent access timeout on org.jboss.invocation.InterceptorContext@1d31f5d0 - could not obtain lock within 5000 MILLISECONDS> javax.ejb.EJBTransactionRolledbackException: JBAS014360: EJB 3.1 FR 4.3.14.1 concurrent access timeout on org.jboss.invocation.InterceptorContext@1d31f5d0 - could not obtain lock within 5000 MILLISECONDS at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleInCallerTx(CMTTxInterceptor.java:138) at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:228) at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:317) at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:214) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) at org.jboss.as.ejb3.remote.EJBRemoteTransactionPropagatingInterceptor.processInvocation(EJBRemoteTransactionPropagatingInterceptor.java:79) ... and on the servers (both in the forwarding and target clusters): 06:35:45,866 ERROR [org.jboss.as.ejb3.invocation] (EJB default - 1) JBAS014134: EJB Invocation failed on component RemoteStatefulSBImpl for method public abstract int org.jboss.test.clusterbench.common.ejb.CommonStatefulSB.getSerialAndIncrement(): javax.ejb.EJBTransactionRolledbackException: JBAS014360: EJB 3.1 FR 4.3.14.1 concurrent access timeout on org.jboss.invocation.InterceptorContext@1d31f5d0 - could not obtain lock within 5000 MILLISECONDS at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleInCallerTx(CMTTxInterceptor.java:138) [jboss-as-ejb3-7.2.1.Final-redhat-4.jar:7.2.1.Final-redhat-4] at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:228) [jboss-as-ejb3-7.2.1.Final-redhat-4.jar:7.2.1.Final-redhat-4] at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:317) [jboss-as-ejb3-7.2.1.Final-redhat-4.jar:7.2.1.Final-redhat-4] at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:214) [jboss-as-ejb3-7.2.1.Final-redhat-4.jar:7.2.1.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.2.Final-redhat-1.jar:1.1.2.Final-redhat-1] at org.jboss.as.ejb3.remote.EJBRemoteTransactionPropagatingInterceptor.processInvocation(EJBRemoteTransactionPropagatingInterceptor.java:79) [jboss-as-ejb3-7.2.1.Final-redhat-4.jar:7.2.1.Final-redhat-4] ... Here's a test run with 200 clients with log levels of org.jboss.ejb.client and org.jboss.as.ejb3 set to DEBUG: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async-timeout-investigation/10/ There's an interesting exception in the log of perf20 (which is in the forwarding cluster): 06:35:45,137 DEBUG [org.jboss.as.ejb3.remote.DescriptorBasedEJBClientContextService] (ejb-client-context-tasks-3-thread-7) Reconnect attempt#1 failed for outbound connection service jboss.remoting.endpoint.subsystem.outbound-connection.remote-ejb-connection: java.lang.RuntimeException: Operation failed with status WAITING at org.jboss.ejb.client.remoting.IoFutureHelper.get(IoFutureHelper.java:93) at org.jboss.as.ejb3.remote.DescriptorBasedEJBClientContextService$OutboundConnectionReconnectHandler.reconnect(DescriptorBasedEJBClientContextService.java:219) at org.jboss.ejb.client.EJBClientContext$ReconnectAttempt.run(EJBClientContext.java:1287) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) [rt.jar:1.6.0_45] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) [rt.jar:1.6.0_45] at java.util.concurrent.FutureTask.run(FutureTask.java:138) [rt.jar:1.6.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_45] at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_45] ... This is possibly related. There's another exception in the log of perf18 (the node that is being shut down): 06:35:36,747 WARN [com.arjuna.ats.jta] (EJB default - 9) ARJUNA016029: SynchronizationImple.afterCompletion - failed for org.jboss.as.ejb3.component.stateful.StatefulSessionSynchronizationInterceptor$StatefulSessionSynchronization@5518a6e8 with exception: java.util.concurrent.RejectedExecutionException at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1774) [rt.jar:1.6.0_45] at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:768) [rt.jar:1.6.0_45] at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:215) [rt.jar:1.6.0_45] at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:397) [rt.jar:1.6.0_45] at java.util.concurrent.Executors$DelegatedScheduledExecutorService.schedule(Executors.java:650) [rt.jar:1.6.0_45] at org.jboss.as.ejb3.cache.impl.backing.PassivatingBackingCacheImpl.schedule(PassivatingBackingCacheImpl.java:293) ... This doesn't look related to me.
Is this a regression from 6.1.0 GA?
After some investigation, I think there is a bug in StatefulSessionSynchronizationInterceptor, specifically in the registered Synchronization responsible for releasing the SFSB instance lock. If the Synchronization callbacks throw an exception, the Lock.unlock() happens outside the context of a lock owner. This would cause a subsequent Lock.tryLock(...) for a new owner to fail. This seems to be an EJB bug and doesn't appear to be directly related to clustering.
Paul, thank you for the investigation. Reassigning to Jaikiran to take a look.
@Jaikiran I've opened and upstream jira for this as well: https://issues.jboss.org/browse/WFLY-1810
The upstream JIRA (https://issues.jboss.org/browse/WFLY-1810) contains a link to a pull request against WildFly (https://github.com/wildfly/wildfly/pull/4876). I took the patch, applied it to EAP and ran the test again. No good, the issue is still there :-( The run: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async-timeout-investigation/13/
Doh! I'm sorry, I incidentally used pure 6.1.1.ER4 without the fix in the build from comment 10. I did another run, this time with the fix for real, but the problem persists. See https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async-timeout-investigation/14/
I'm wondering if the issue is not created by the server trying to recover the transactions during shutdown.
Assigning jpai EJB issues to david.lloyd. Please re-assign to Cheng or others as needed.
@Láďo, I'm trying to get rid of old open bugzillas for EJB. Can you check that this still occurs in EAP 6.3? If yes, we can propose it for 6.4.
Yes, we're still seeing this, see e.g. https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async/39/ for EAP 6.3.0.ER9. Proposing for 6.4.
Ladislav, David, Could one of you please provide a draft release note in the Doc Text field above? I'm having trouble parsing the circumstances/consequences of the issue. Much appreciated.
There are 2 clusters, each of them having 2 nodes.