Bug 1299844

Summary: [QA](6.4.z) Deadlock in EJB client after fix for bug 1266112
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Ladislav Thon <lthon>
Component: Clustering, EJBAssignee: Enrique Gonzalez Martinez <egonzale>
Status: CLOSED CURRENTRELEASE QA Contact: Ladislav Thon <lthon>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.4.6CC: aogburn, asaji, bbaranow, bmaxwell, cdewolf, egonzale, istudens, jbilek, paul.ferraro
Target Milestone: CR1   
Target Release: EAP 6.4.8   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-17 12:32:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1279553, 1299983    

Description Ladislav Thon 2016-01-19 11:49:16 UTC
Description of problem:

This is the same test that found bug 1266112. Specifically, the tests starts a cluster of 2 servers, deploys a simple application and calls a remote EJB from a standalone client (actually 10 clients in 10 threads). During the test, both servers are repeatedly shut down and then brought back to life, but only one at a time.

Around the time one of the servers is going down, a deadlock in the EJB client library occurs. This was found by investigating why the test appears to hang -- a simple `jstack -l $PID` reveals

Found one Java-level deadlock:
=============================
"Remoting "config-based-ejb-client-endpoint" read-1":
  waiting to lock monitor 0x00007fc2786a53f8 (object 0x000000078609c840, a java.util.Collections$SynchronizedSet),
  which is held by "Client09"
"Client09":
  waiting to lock monitor 0x00007fc2788e4710 (object 0x000000078608c238, a java.util.IdentityHashMap),
  which is held by "Remoting "config-based-ejb-client-endpoint" read-1"

Java stack information for the threads listed above:
===================================================
"Remoting "config-based-ejb-client-endpoint" read-1":
	at java.util.Collections$SynchronizedCollection.remove(Collections.java:1580)
	- waiting to lock <0x000000078609c840> (a java.util.Collections$SynchronizedSet)
	at org.jboss.ejb.client.ClusterContext.receiverContextClosed(ClusterContext.java:414)
	at org.jboss.ejb.client.EJBClientContext.unregisterEJBReceiver(EJBClientContext.java:443)
	- locked <0x000000078608c238> (a java.util.IdentityHashMap)
	at org.jboss.ejb.client.EJBReceiverContext.close(EJBReceiverContext.java:59)
	at org.jboss.ejb.client.remoting.ChannelAssociation.notifyBrokenChannel(ChannelAssociation.java:407)
	at org.jboss.ejb.client.remoting.ChannelAssociation.access$100(ChannelAssociation.java:59)
	at org.jboss.ejb.client.remoting.ChannelAssociation$1.handleClose(ChannelAssociation.java:118)
	at org.jboss.ejb.client.remoting.ChannelAssociation$1.handleClose(ChannelAssociation.java:110)
	at org.jboss.remoting3.spi.SpiUtils.safeHandleClose(SpiUtils.java:54)
	at org.jboss.remoting3.spi.AbstractHandleableCloseable$CloseHandlerTask.run(AbstractHandleableCloseable.java:514)
	at org.jboss.remoting3.spi.AbstractHandleableCloseable.runCloseTask(AbstractHandleableCloseable.java:419)
	at org.jboss.remoting3.spi.AbstractHandleableCloseable.closeComplete(AbstractHandleableCloseable.java:290)
	at org.jboss.remoting3.remote.RemoteConnectionChannel.closeAction(RemoteConnectionChannel.java:531)
	at org.jboss.remoting3.spi.AbstractHandleableCloseable.closeAsync(AbstractHandleableCloseable.java:372)
	at org.jboss.remoting3.remote.RemoteConnectionHandler.closeAllChannels(RemoteConnectionHandler.java:437)
	at org.jboss.remoting3.remote.RemoteConnectionHandler.receiveCloseRequest(RemoteConnectionHandler.java:217)
	at org.jboss.remoting3.remote.RemoteReadListener.handleEvent(RemoteReadListener.java:113)
	at org.jboss.remoting3.remote.RemoteReadListener.handleEvent(RemoteReadListener.java:46)
	at org.xnio.ChannelListeners.invokeChannelListener(ChannelListeners.java:72)
	at org.xnio.channels.TranslatingSuspendableChannel.handleReadable(TranslatingSuspendableChannel.java:189)
	at org.xnio.channels.TranslatingSuspendableChannel$1.handleEvent(TranslatingSuspendableChannel.java:103)
	at org.xnio.ChannelListeners.invokeChannelListener(ChannelListeners.java:72)
	at org.xnio.nio.NioHandle.run(NioHandle.java:90)
	at org.xnio.nio.WorkerThread.run(WorkerThread.java:198)
"Client09":
	at org.jboss.ejb.client.EJBClientContext.getNodeEJBReceiver(EJBClientContext.java:860)
	- waiting to lock <0x000000078608c238> (a java.util.IdentityHashMap)
	at org.jboss.ejb.client.EJBClientContext.getNodeEJBReceiver(EJBClientContext.java:846)
	at org.jboss.ejb.client.EJBClientContext.getNodeEJBReceiverContext(EJBClientContext.java:888)
	at org.jboss.ejb.client.ClusterContext.isNodeConnectedAndDeployed(ClusterContext.java:251)
	at org.jboss.ejb.client.ClusterContext.getConnectedAndDeployedNodes(ClusterContext.java:266)
	- locked <0x000000078609c840> (a java.util.Collections$SynchronizedSet)
	at org.jboss.ejb.client.ClusterContext.getEJBReceiverContext(ClusterContext.java:125)
	at org.jboss.ejb.client.ClusterContext.getEJBReceiverContext(ClusterContext.java:96)
	at org.jboss.ejb.client.EJBClientContext.getClusterEJBReceiverContext(EJBClientContext.java:958)
	at org.jboss.ejb.client.EJBClientContext.getClusterEJBReceiverContext(EJBClientContext.java:938)
	at org.jboss.ejb.client.ReceiverInterceptor.handleInvocation(ReceiverInterceptor.java:108)
	at org.jboss.ejb.client.EJBClientInvocationContext.sendRequest(EJBClientInvocationContext.java:186)
	at org.jboss.ejb.client.EJBInvocationHandler.sendRequestWithPossibleRetries(EJBInvocationHandler.java:255)
	at org.jboss.ejb.client.EJBInvocationHandler.doInvoke(EJBInvocationHandler.java:200)
	at org.jboss.ejb.client.EJBInvocationHandler.doInvoke(EJBInvocationHandler.java:183)
	at org.jboss.ejb.client.EJBInvocationHandler.invoke(EJBInvocationHandler.java:146)
	at com.sun.proxy.$Proxy23.test(Unknown Source)
	at org.jboss.as.test.clustering.lodh.june2013.tools.ClientThread.doCall(ClientThread.java:75)
	at org.jboss.as.test.clustering.lodh.june2013.tools.ClientThread.run(ClientThread.java:56)

This appears to be caused by the fix for bug 1266112, which is commit https://github.com/jbossas/jboss-ejb-client/commit/66cda3bcd2 The fix itself is arguably correct, but it reveals another concurrency issue.

Version-Release number of selected component (if applicable):

EAP 6.4.6.CP.CR2

JBoss EJB Client 1.0.32.Final-redhat-1

Turns out that this version of EJB client also appears in EAP 6.4.5, so EAP 6.4.5 is most likely also affected. This wasn't discovered when verifying the fix for bug 1266112, because that bug appeared in scenarios where the EAP process was `kill -9`-ed, so I only used these "processKill" scenarios for verifying. On the other hand, this bug appears in the scenarios where EAP is shut down "gracefully".

How reproducible:

Often, but not 100% of the time.

Steps to Reproduce:

See a private comment below.

Actual results:

Deadlock occurs.

Expected results:

No deadlock occurs.

Comment 3 Enrique Gonzalez Martinez 2016-01-22 08:02:05 UTC
The deadlock is simple to reproduce. It happens when

Comment 4 Enrique Gonzalez Martinez 2016-02-01 07:19:05 UTC
For some reason my last commet is just half of it. The bug can be reproduced easily with a stateless ejb and 2 instances of the app server.

If you shutdown one of the servers during an ejb-client invocation you get this error (it is a race condition). Reducing the scope of the connectedAndDeployed in the function getConnectedAndDeployedNodes solves the problem.

PR upstream 7.0.0: https://github.com/jbossas/jboss-ejb-client/pull/138
PR 6.4.z: https://github.com/jbossas/jboss-ejb-client/pull/137

Comment 5 Enrique Gonzalez Martinez 2016-02-02 08:08:58 UTC
merged

6.4.z: 1.0.x https://github.com/jbossas/jboss-ejb-client/commit/3a19ab2113cce7026847c70058195b2c5d8dce28 

@Brad this got merged in 6.4.z branch in ejb-client. this will get in the next ejb-client upgrade

Comment 8 Jiří Bílek 2016-05-12 13:49:51 UTC
Verified with EAP 6.4.8.CP.CR2

Comment 9 JBoss JIRA Server 2016-05-17 10:41:40 UTC
Enrique González Martínez <elguardian> updated the status of jira JBEAP-3114 to Resolved

Comment 10 JBoss JIRA Server 2016-08-23 11:38:33 UTC
Jiri Pallich <jpallich> updated the status of jira JBEAP-3114 to Closed

Comment 11 Petr Penicka 2017-01-17 12:32:38 UTC
Retroactively bulk-closing issues from released EAP 6.4 cumulative patches.