Description of problem: All threads in pools of "remote-thread-NN" and "OOB-NN" of the coordinator are depleted by org.infinispan.topology.ClusterTopologyManagerImpl.waitForView(). This method is waiting to receive new view but such view update message is queued and never been processed, causing deadlock. Here are 2 example stacks from each pool. ~~~ "remote-thread-172" daemon prio=10 tid=0x00007f0cc48c0000 nid=0x28ca4 in Object.wait() [0x00007f0c5f25b000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.infinispan.topology.ClusterTopologyManagerImpl.waitForView(ClusterTopologyManagerImpl.java:357) - locked <0x00000000ff3bd900> (a java.lang.Object) at org.infinispan.topology.ClusterTopologyManagerImpl.handleJoin(ClusterTopologyManagerImpl.java:123) at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:162) at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:144) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:276) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) "OOB-210,shared=udp" prio=10 tid=0x00007f0c90154800 nid=0x28ca3 in Object.wait() [0x00007f0c5f35a000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.infinispan.topology.ClusterTopologyManagerImpl.waitForView(ClusterTopologyManagerImpl.java:357) - locked <0x00000000ff3bd900> (a java.lang.Object) at org.infinispan.topology.ClusterTopologyManagerImpl.handleJoin(ClusterTopologyManagerImpl.java:123) at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:162) at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:144) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:276) at java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(ThreadPoolExecutor.java:2025) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) at org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl.execute(BlockingTaskAwareExecutorServiceImpl.java:134) at org.infinispan.executors.LazyInitializingBlockingTaskAwareExecutorService.execute(LazyInitializingBlockingTaskAwareExecutorService.java:128) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommandFromLocalCluster(CommandAwareRpcDispatcher.java:268) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:209) at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:460) at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:377) at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:250) at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:677) at org.jgroups.blocks.mux.MuxUpHandler.up(MuxUpHandler.java:130) at org.jgroups.JChannel.up(JChannel.java:733) at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1029) at org.jgroups.protocols.FRAG2.up(FRAG2.java:165) at org.jgroups.protocols.FlowControl.up(FlowControl.java:394) at org.jgroups.protocols.FlowControl.up(FlowControl.java:383) at org.jgroups.protocols.pbcast.GMS.up(GMS.java:1042) at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234) at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1064) at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:779) at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:426) at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:652) at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155) at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:200) at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:297) at org.jgroups.protocols.MERGE3.up(MERGE3.java:288) at org.jgroups.protocols.Discovery.up(Discovery.java:291) at org.jgroups.protocols.TP$ProtocolAdapter.up(TP.java:2837) at org.jgroups.protocols.TP.passMessageUp(TP.java:1572) at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1823) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) ~~~ Version-Release number of selected component (if applicable): JDG 6.4.0 (infinispan-core-6.2.0.Final-redhat-4.jar) JDG 6.4.1 is probably the same because implementation of ClusterTopologyManagerImpl.waitForView() is not changed. How reproducible: Once in the customer's environment when upgrading to JDG 6.4.0. It is possible to reproduce locally with the following setting but the reproducibility is not so high (about once per 20 trials or less with 5 node cluster and 400 caches). Attachment: p2.coord-deadlock.waitForView.zip ./config/clustered.conf ./config/clustered.xml ./server_log/coord.server.log ./server_log/node1.server.log ./server_log/node2.server.log ./server_log/node3.server.log ./server_log/node4.server.log ./stack/coord.log ./stack/node1.log ./stack/node2.log ./stack/node3.log ./stack/node4.log Steps to Reproduce: Start one node first to be the coordinator, then start the rest of all nodes at once. Its configuration is as follows. - System property, -Dorg.jboss.server.bootstrap.maxThreads=500 - 400 caches defined - Tweak JGroups thread pools as follows. ~~~ <transport type="UDP" socket-binding="jgroups-udp"> <property name="internal_thread_pool.enabled">true</property> <property name="internal_thread_pool.min_threads">20</property> <property name="internal_thread_pool.max_threads">20</property> <property name="internal_thread_pool.queue_enabled">true</property> <property name="internal_thread_pool.queue_max_size">1000000</property> <property name="internal_thread_pool.rejection_policy">Abort</property> <property name="thread_pool.enabled">true</property> <property name="thread_pool.min_threads">20</property> <property name="thread_pool.max_threads">20</property> <property name="thread_pool.queue_enabled">true</property> <property name="thread_pool.queue_max_size">1000000</property> <property name="thread_pool.rejection_policy">Abort</property> <property name="oob_thread_pool.enabled">true</property> <property name="oob_thread_pool.min_threads">210</property> <property name="oob_thread_pool.max_threads">210</property> <property name="oob_thread_pool.queue_enabled">true</property> <property name="oob_thread_pool.queue_max_size">1000000</property> <property name="oob_thread_pool.rejection_policy">discard</property> </transport> ~~~ - Tweak infinispan-transport thread pool as follows. ~~~ <subsystem xmlns="urn:jboss:domain:threads:1.1"> <thread-factory name="infinispan-factory" group-name="infinispan" priority="5" thread-name-pattern="%G %f-%t"/> <blocking-bounded-queue-thread-pool name="infinispan-transport" allow-core-timeout="true"> <core-threads count="1025"/> <max-threads count="1025"/> <queue-length count="1000000"/> <keepalive-time time="0" unit="milliseconds"/> <thread-factory name="infinispan-factory"/> </blocking-bounded-queue-thread-pool> </subsystem> ~~~ Actual results: Coordinator's remote-thread and OOB thread pools are all filled by ClusterTopologyManagerImpl.waitForView(). Expected results: No deadlock. Additional info: System property "org.jboss.server.bootstrap.maxThreads" defines the number of MSC threads. Its default is 2 * numOfCpus. Each cache send a JOIN request in this thread at boot time. So a cluster using many core CPUs and having number of caches can hit this issue.
Created attachment 1010069 [details] p2.coord-deadlock.waitForView.zip
Copied from JIRA: The fix for bug 1217380 (ISPN-5106) already fixed the waitForView() problem partially, but it wasn't enough. When the coordinator installed two views in quick succession, the thread updating the cache members list for the first view would block waiting for the CacheTopologyControlCommand(POLICY_GET_STATUS) response from the other members. Then, because the other members got the newer view before sending the join requests, all the remote-executor and OOB threads would block in waitForView(), and there would be no way to receive the POLICY_GET_STATUS responses (since processing a response also needs an OOB thread). The solution was to update the cache members asynchronously. Testing with a limited amount of OOB/remote-executor threads also exposed some other deadlocks, and the pull request tries to plug as many of them as possible. However, because the caches will not start in the same order on every node, there is always the possibility of 2 nodes sending state transfer requests to each other (for different caches) and not being able to process the responses (because the OOB threads are all blocked, actually waiting for those responses). So a deadlock is still possible if remote-executor.max-threads + OOB.max_threads < number of caches. The remote executor configuration is ignored in JDG server (bug 1219417), the pull request includes a fix for that as well. The attached configuration didn't specify anything for the remote executor, I would suggest defining a blocking-queueless-thread-pool: <transport ... remote-command-executor="infinispan-remote"/> ... <blocking-queueless-thread-pool name="infinispan-remote"> <max-threads count="400"/> <keepalive-time time="0" unit="milliseconds"/> <thread-factory name="infinispan-factory"/> </blocking-queueless-thread-pool>
Ran several tests using the following scenario and JDG 6.5.0 CR1: 1) Start 4 servers hosting 1000 caches 2) Repeat the following 20 times: a) Start writing to the caches using 3 clients b) Kill all 4 servers c) Restart all 4 servers and wait until there are no topology changes or cache rehashes According to Dan, this scenario should be sufficient to verify this issue