Described by Nicolas Filotto: "A startup issue in cluster mode when we don't fetch the in-memory state, to workaround it, I simply enabled fecthInMemoryState but it is not how we want to configure the cache, it will add significant latency in case we add a new node to an already running cluster. This is the most annoying issue as we have to configure the cache in a non expected manner. I'm facing a very annoying issue with ISPN 5.2.7.Final + Synchronous Replication + a state transfert with fetchInMemoryState set to false + udp. With this particular configuration which is unfortunately the target configuration of JCR 1.16, I get some deadlocks at cache startup that cause issues of next type on the master: 09.04.2014 12:39:27,901 *ERROR* [transport-thread-1] ClusterTopologyManagerImpl: ISPN000230: Failed to start rebalance for cache foo (ClusterTopologyManagerImpl.java, line 132) org.infinispan.CacheException: org.jgroups.TimeoutException: TimeoutException at org.infinispan.util.Util.rewrapAsCacheException(Util.java:542) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:186) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:515) at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterAsync(ClusterTopologyManagerImpl.java:607) at org.infinispan.topology.ClusterTopologyManagerImpl.broadcastRebalanceStart(ClusterTopologyManagerImpl.java:405) at org.infinispan.topology.ClusterTopologyManagerImpl.startRebalance(ClusterTopologyManagerImpl.java:395) at org.infinispan.topology.ClusterTopologyManagerImpl.access$000(ClusterTopologyManagerImpl.java:66) at org.infinispan.topology.ClusterTopologyManagerImpl$1.call(ClusterTopologyManagerImpl.java:129) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) Caused by: org.jgroups.TimeoutException: TimeoutException at org.jgroups.util.Promise._getResultWithTimeout(Promise.java:145) at org.jgroups.util.Promise.getResultWithTimeout(Promise.java:40) at org.jgroups.util.AckCollector.waitForAllAcks(AckCollector.java:93) at org.jgroups.protocols.RSVP$Entry.block(RSVP.java:287) at org.jgroups.protocols.RSVP.down(RSVP.java:118) at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:238) at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1025) at org.jgroups.JChannel.down(JChannel.java:722) at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:616) at org.jgroups.blocks.RequestCorrelator.sendUnicastRequest(RequestCorrelator.java:204) at org.jgroups.blocks.UnicastRequest.sendRequest(UnicastRequest.java:43) at org.jgroups.blocks.Request.execute(Request.java:83) at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:370) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:301) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:179) ... 11 more To reproduce, I launch 2 JVM on my local machine with -Djava.net.preferIPv4Stack=true -Djgroups.bind_addr=127.0.0.1 that will both call the following code: public class StartUpTest extends TestCase { public void testStartUp() throws Exception { GlobalConfigurationBuilder configBuilder = new GlobalConfigurationBuilder(); configBuilder.transport().defaultTransport().addProperty("configurationFile", "udp.xml"); EmbeddedCacheManager manager = new DefaultCacheManager(configBuilder.build()); ConfigurationBuilder confBuilder = new ConfigurationBuilder(); confBuilder.clustering().cacheMode(CacheMode.REPL_SYNC).stateTransfer().fetchInMemoryState(false); Configuration conf = confBuilder.build(); manager.defineConfiguration("foo", conf); Cache<Object, Object> cache = manager.getCache("foo"); cache.start(); System.out.println("Fully Started"); synchronized (this) { wait(); } } } The first instance will start normally, the second one will start after a pause (that is actually a timeout) and on the first instance we get the stack trace described above. I have tested with 5.3.0.Alpha1 and it works normally. I have also tested with ISPN 5.2.8.Final and it fails."
Needed for JBoss Portal 6.2
There is another workaround: setting RSVP.ack_on_delivery=false in the JGroups configuration (or even removing RSVP from the stack completely).
Nicolas Filotto ack that the setting Dan suggested works for him, so this should be rejected.