Bug 1259418 - StateTransferManager.waitForInitialTransferToComplete can fail if the coordinator crashes
StateTransferManager.waitForInitialTransferToComplete can fail if the coordin...
Status: VERIFIED
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan (Show other bugs)
6.6.0
Unspecified Unspecified
unspecified Severity unspecified
: DR3
: 6.6.0
Assigned To: Dan Berindei
Martin Gencur
:
Depends On: 1255665
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-02 10:37 EDT by Dan Berindei
Modified: 2016-01-07 04:12 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1255665
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
JBoss Issue Tracker ISPN-5459 Critical Resolved StateTransferManager.waitForInitialTransferToComplete can fail if the coordinator crashes 2016-01-14 05:35 EST

  None (edit)
Description Dan Berindei 2015-09-02 10:37:49 EDT
+++ This bug was initially created as a clone of Bug #1255665 +++

Please see https://issues.jboss.org/browse/ISPN-5459

Same issue is present in JDG 6.5.1.ER1, ClusterListenerDistTxAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener fails with the same exception
Comment 2 Dan Berindei 2015-09-02 10:45:29 EDT
PR: https://github.com/infinispan/jdg/pull/738
Comment 3 Matej Čimbora 2016-01-05 06:47:35 EST
Looking at the code I'm wondering whether this respects configuration.clustering.stateTransfer.timeout setting. If coordinator leaves while we're checking whether rebalancing is enabled, this can end up waiting indefinitely in LocalTopologyManagerImpl.isRebalancingEnabled:343 - transport.waitForView(nextViewId).

Apart from this, there's a new random failure in ClusterListenerDistTxAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener.

org.infinispan.util.concurrent.TimeoutException: Timed out before caches had complete views.  Expected 3 members in each view.  Views are as follows: [[ClusterListenerDistTxAddListenerTest-NodeM-60615|3] (4) [ClusterListenerDistTxAddListenerTest-NodeM-60615, ClusterListenerDistTxAddListenerTest-NodeN-60848, ClusterListenerDistTxAddListenerTest-NodeO-57530, ClusterListenerDistTxAddListenerTest-NodeP-7100]]
	at org.infinispan.test.TestingUtil.viewsTimedOut(TestingUtil.java:278)
	at org.infinispan.test.TestingUtil.blockUntilViewsReceived(TestingUtil.java:340)
	at org.infinispan.test.TestingUtil.blockUntilViewsReceived(TestingUtil.java:964)
	at org.infinispan.notifications.cachelistener.cluster.AbstractClusterListenerDistAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener(AbstractClusterListenerDistAddListenerTest.java:249)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:80)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
	at org.testng.TestRunner.privateRun(TestRunner.java:767)
	at org.testng.TestRunner.run(TestRunner.java:617)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
	at org.testng.SuiteRunner.access$000(SuiteRunner.java:37)
	at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:368)
	at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Comment 4 Dan Berindei 2016-01-06 10:51:15 EST
It doesn't respect the state transfer timeout, indeed. However, this shouldn't be a problem in practice, because the state transfer timeout should be much bigger than the time it takes to elect a new coordinator.

I'm not sure the test failure is related, since the failure seems to happen when a node leaves, and my changes are only affect joining. I haven't been able to reproduce it on my machine, it would be great if you can reproduce it with trace enabled.
Comment 5 Matej Čimbora 2016-01-07 04:12:38 EST
Fair enough. I'll create a new issue for the random failure.

Note You need to log in before you can comment on or make changes to this bug.