Description of problem: This is the same test that found bug 1266112. Specifically, the tests starts a cluster of 2 servers, deploys a simple application and calls a remote EJB from a standalone client (actually 10 clients in 10 threads). During the test, both servers are repeatedly shut down and then brought back to life, but only one at a time. Around the time one of the servers is going down, a deadlock in the EJB client library occurs. This was found by investigating why the test appears to hang -- a simple `jstack -l $PID` reveals Found one Java-level deadlock: ============================= "Remoting "config-based-ejb-client-endpoint" read-1": waiting to lock monitor 0x00007fc2786a53f8 (object 0x000000078609c840, a java.util.Collections$SynchronizedSet), which is held by "Client09" "Client09": waiting to lock monitor 0x00007fc2788e4710 (object 0x000000078608c238, a java.util.IdentityHashMap), which is held by "Remoting "config-based-ejb-client-endpoint" read-1" Java stack information for the threads listed above: =================================================== "Remoting "config-based-ejb-client-endpoint" read-1": at java.util.Collections$SynchronizedCollection.remove(Collections.java:1580) - waiting to lock <0x000000078609c840> (a java.util.Collections$SynchronizedSet) at org.jboss.ejb.client.ClusterContext.receiverContextClosed(ClusterContext.java:414) at org.jboss.ejb.client.EJBClientContext.unregisterEJBReceiver(EJBClientContext.java:443) - locked <0x000000078608c238> (a java.util.IdentityHashMap) at org.jboss.ejb.client.EJBReceiverContext.close(EJBReceiverContext.java:59) at org.jboss.ejb.client.remoting.ChannelAssociation.notifyBrokenChannel(ChannelAssociation.java:407) at org.jboss.ejb.client.remoting.ChannelAssociation.access$100(ChannelAssociation.java:59) at org.jboss.ejb.client.remoting.ChannelAssociation$1.handleClose(ChannelAssociation.java:118) at org.jboss.ejb.client.remoting.ChannelAssociation$1.handleClose(ChannelAssociation.java:110) at org.jboss.remoting3.spi.SpiUtils.safeHandleClose(SpiUtils.java:54) at org.jboss.remoting3.spi.AbstractHandleableCloseable$CloseHandlerTask.run(AbstractHandleableCloseable.java:514) at org.jboss.remoting3.spi.AbstractHandleableCloseable.runCloseTask(AbstractHandleableCloseable.java:419) at org.jboss.remoting3.spi.AbstractHandleableCloseable.closeComplete(AbstractHandleableCloseable.java:290) at org.jboss.remoting3.remote.RemoteConnectionChannel.closeAction(RemoteConnectionChannel.java:531) at org.jboss.remoting3.spi.AbstractHandleableCloseable.closeAsync(AbstractHandleableCloseable.java:372) at org.jboss.remoting3.remote.RemoteConnectionHandler.closeAllChannels(RemoteConnectionHandler.java:437) at org.jboss.remoting3.remote.RemoteConnectionHandler.receiveCloseRequest(RemoteConnectionHandler.java:217) at org.jboss.remoting3.remote.RemoteReadListener.handleEvent(RemoteReadListener.java:113) at org.jboss.remoting3.remote.RemoteReadListener.handleEvent(RemoteReadListener.java:46) at org.xnio.ChannelListeners.invokeChannelListener(ChannelListeners.java:72) at org.xnio.channels.TranslatingSuspendableChannel.handleReadable(TranslatingSuspendableChannel.java:189) at org.xnio.channels.TranslatingSuspendableChannel$1.handleEvent(TranslatingSuspendableChannel.java:103) at org.xnio.ChannelListeners.invokeChannelListener(ChannelListeners.java:72) at org.xnio.nio.NioHandle.run(NioHandle.java:90) at org.xnio.nio.WorkerThread.run(WorkerThread.java:198) "Client09": at org.jboss.ejb.client.EJBClientContext.getNodeEJBReceiver(EJBClientContext.java:860) - waiting to lock <0x000000078608c238> (a java.util.IdentityHashMap) at org.jboss.ejb.client.EJBClientContext.getNodeEJBReceiver(EJBClientContext.java:846) at org.jboss.ejb.client.EJBClientContext.getNodeEJBReceiverContext(EJBClientContext.java:888) at org.jboss.ejb.client.ClusterContext.isNodeConnectedAndDeployed(ClusterContext.java:251) at org.jboss.ejb.client.ClusterContext.getConnectedAndDeployedNodes(ClusterContext.java:266) - locked <0x000000078609c840> (a java.util.Collections$SynchronizedSet) at org.jboss.ejb.client.ClusterContext.getEJBReceiverContext(ClusterContext.java:125) at org.jboss.ejb.client.ClusterContext.getEJBReceiverContext(ClusterContext.java:96) at org.jboss.ejb.client.EJBClientContext.getClusterEJBReceiverContext(EJBClientContext.java:958) at org.jboss.ejb.client.EJBClientContext.getClusterEJBReceiverContext(EJBClientContext.java:938) at org.jboss.ejb.client.ReceiverInterceptor.handleInvocation(ReceiverInterceptor.java:108) at org.jboss.ejb.client.EJBClientInvocationContext.sendRequest(EJBClientInvocationContext.java:186) at org.jboss.ejb.client.EJBInvocationHandler.sendRequestWithPossibleRetries(EJBInvocationHandler.java:255) at org.jboss.ejb.client.EJBInvocationHandler.doInvoke(EJBInvocationHandler.java:200) at org.jboss.ejb.client.EJBInvocationHandler.doInvoke(EJBInvocationHandler.java:183) at org.jboss.ejb.client.EJBInvocationHandler.invoke(EJBInvocationHandler.java:146) at com.sun.proxy.$Proxy23.test(Unknown Source) at org.jboss.as.test.clustering.lodh.june2013.tools.ClientThread.doCall(ClientThread.java:75) at org.jboss.as.test.clustering.lodh.june2013.tools.ClientThread.run(ClientThread.java:56) This appears to be caused by the fix for bug 1266112, which is commit https://github.com/jbossas/jboss-ejb-client/commit/66cda3bcd2 The fix itself is arguably correct, but it reveals another concurrency issue. Version-Release number of selected component (if applicable): EAP 6.4.6.CP.CR2 JBoss EJB Client 1.0.32.Final-redhat-1 Turns out that this version of EJB client also appears in EAP 6.4.5, so EAP 6.4.5 is most likely also affected. This wasn't discovered when verifying the fix for bug 1266112, because that bug appeared in scenarios where the EAP process was `kill -9`-ed, so I only used these "processKill" scenarios for verifying. On the other hand, this bug appears in the scenarios where EAP is shut down "gracefully". How reproducible: Often, but not 100% of the time. Steps to Reproduce: See a private comment below. Actual results: Deadlock occurs. Expected results: No deadlock occurs.
The deadlock is simple to reproduce. It happens when
For some reason my last commet is just half of it. The bug can be reproduced easily with a stateless ejb and 2 instances of the app server. If you shutdown one of the servers during an ejb-client invocation you get this error (it is a race condition). Reducing the scope of the connectedAndDeployed in the function getConnectedAndDeployedNodes solves the problem. PR upstream 7.0.0: https://github.com/jbossas/jboss-ejb-client/pull/138 PR 6.4.z: https://github.com/jbossas/jboss-ejb-client/pull/137
merged 6.4.z: 1.0.x https://github.com/jbossas/jboss-ejb-client/commit/3a19ab2113cce7026847c70058195b2c5d8dce28 @Brad this got merged in 6.4.z branch in ejb-client. this will get in the next ejb-client upgrade
Verified with EAP 6.4.8.CP.CR2
Enrique González Martínez <elguardian> updated the status of jira JBEAP-3114 to Resolved
Jiri Pallich <jpallich> updated the status of jira JBEAP-3114 to Closed
Retroactively bulk-closing issues from released EAP 6.4 cumulative patches.