Bug 1087244

Summary:	Infinispan issue: A startup issue in cluster mode when the in-memory state is not fetched
Product:	[JBoss] JBoss Enterprise Application Platform 6	Reporter:	Boleslaw Dawidowicz <bdawidow>
Component:	Clustering	Assignee:	Paul Ferraro <paul.ferraro>
Status:	CLOSED WONTFIX	QA Contact:	Jitka Kozana <jkudrnac>
Severity:	unspecified	Docs Contact:	Russell Dickenson <rdickens>
Priority:	unspecified
Version:	6.3.0	CC:	dberinde, kkhan, lthon, mmarkus
Target Milestone:	ER9
Target Release:	EAP 6.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-07-07 06:45:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1087264
Bug Blocks:

Description Boleslaw Dawidowicz 2014-04-14 07:28:56 UTC

Described by Nicolas Filotto: 

"A startup issue in cluster mode when we don't fetch the in-memory state,
to workaround it, I simply enabled fecthInMemoryState but it is not how we
want to configure the cache, it will add significant latency in case we add
a new node to an already running cluster. This is the most annoying issue
as we have to configure the cache in a non expected manner.


 I'm facing a very annoying issue with ISPN 5.2.7.Final + Synchronous Replication + a state transfert with fetchInMemoryState set to false + udp. With this particular configuration which is unfortunately the target configuration of JCR 1.16, I get some deadlocks at cache startup that cause issues of next type on the master:

09.04.2014 12:39:27,901 *ERROR* [transport-thread-1] ClusterTopologyManagerImpl: ISPN000230: Failed to start rebalance for cache foo (ClusterTopologyManagerImpl.java, line 132) 

org.infinispan.CacheException: org.jgroups.TimeoutException: TimeoutException

at org.infinispan.util.Util.rewrapAsCacheException(Util.java:542)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:186)

at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:515)

at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterAsync(ClusterTopologyManagerImpl.java:607)

at org.infinispan.topology.ClusterTopologyManagerImpl.broadcastRebalanceStart(ClusterTopologyManagerImpl.java:405)

at org.infinispan.topology.ClusterTopologyManagerImpl.startRebalance(ClusterTopologyManagerImpl.java:395)

at org.infinispan.topology.ClusterTopologyManagerImpl.access$000(ClusterTopologyManagerImpl.java:66)

at org.infinispan.topology.ClusterTopologyManagerImpl$1.call(ClusterTopologyManagerImpl.java:129)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)

at java.lang.Thread.run(Thread.java:680)

Caused by: org.jgroups.TimeoutException: TimeoutException

at org.jgroups.util.Promise._getResultWithTimeout(Promise.java:145)

at org.jgroups.util.Promise.getResultWithTimeout(Promise.java:40)

at org.jgroups.util.AckCollector.waitForAllAcks(AckCollector.java:93)

at org.jgroups.protocols.RSVP$Entry.block(RSVP.java:287)

at org.jgroups.protocols.RSVP.down(RSVP.java:118)

at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:238)

at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1025)

at org.jgroups.JChannel.down(JChannel.java:722)

at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:616)

at org.jgroups.blocks.RequestCorrelator.sendUnicastRequest(RequestCorrelator.java:204)

at org.jgroups.blocks.UnicastRequest.sendRequest(UnicastRequest.java:43)

at org.jgroups.blocks.Request.execute(Request.java:83)

at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:370)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:301)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:179)

... 11 more

To reproduce, I launch 2 JVM on my local machine with -Djava.net.preferIPv4Stack=true -Djgroups.bind_addr=127.0.0.1 that will both call the following code:

public class StartUpTest extends TestCase

{

   public void testStartUp() throws Exception

   {

      GlobalConfigurationBuilder configBuilder = new GlobalConfigurationBuilder();

      configBuilder.transport().defaultTransport().addProperty("configurationFile", "udp.xml");

      EmbeddedCacheManager manager = new DefaultCacheManager(configBuilder.build());

      ConfigurationBuilder confBuilder = new ConfigurationBuilder();

      confBuilder.clustering().cacheMode(CacheMode.REPL_SYNC).stateTransfer().fetchInMemoryState(false);

      Configuration conf = confBuilder.build();

      manager.defineConfiguration("foo", conf);

      Cache<Object, Object> cache = manager.getCache("foo");

      cache.start();

      System.out.println("Fully Started");

      synchronized (this)

      {

         wait();

      }

   }

}
The first instance will start normally, the second one will start after a pause (that is actually a timeout) and on the first instance we get the stack trace described above.
I have tested with 5.3.0.Alpha1 and it works normally. I have also tested with ISPN 5.2.8.Final and it fails."

Comment 1 Boleslaw Dawidowicz 2014-04-14 07:31:48 UTC

Needed for JBoss Portal 6.2

Comment 2 Dan Berindei 2014-04-17 15:52:50 UTC

There is another workaround: setting RSVP.ack_on_delivery=false in the JGroups configuration (or even removing RSVP from the stack completely).

Comment 3 Mircea Markus 2014-04-23 15:49:57 UTC

Nicolas Filotto ack that the setting Dan suggested works for him, so this should be rejected.