Bug 1259418 - StateTransferManager.waitForInitialTransferToComplete can fail if the coordinator crashes
Summary: StateTransferManager.waitForInitialTransferToComplete can fail if the coordin...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: DR3
: 6.6.0
Assignee: Tristan Tarrant
QA Contact: Martin Gencur
URL:
Whiteboard:
Depends On: 1255665
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-09-02 14:37 UTC by Dan Berindei
Modified: 2025-02-10 03:48 UTC (History)
4 users (show)

Fixed In Version:
Clone Of: 1255665
Environment:
Last Closed: 2025-02-10 03:48:07 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-5459 0 Critical Resolved StateTransferManager.waitForInitialTransferToComplete can fail if the coordinator crashes 2016-01-14 10:35:41 UTC

Description Dan Berindei 2015-09-02 14:37:49 UTC
+++ This bug was initially created as a clone of Bug #1255665 +++

Please see https://issues.jboss.org/browse/ISPN-5459

Same issue is present in JDG 6.5.1.ER1, ClusterListenerDistTxAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener fails with the same exception

Comment 2 Dan Berindei 2015-09-02 14:45:29 UTC
PR: https://github.com/infinispan/jdg/pull/738

Comment 3 Matej Čimbora 2016-01-05 11:47:35 UTC
Looking at the code I'm wondering whether this respects configuration.clustering.stateTransfer.timeout setting. If coordinator leaves while we're checking whether rebalancing is enabled, this can end up waiting indefinitely in LocalTopologyManagerImpl.isRebalancingEnabled:343 - transport.waitForView(nextViewId).

Apart from this, there's a new random failure in ClusterListenerDistTxAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener.

org.infinispan.util.concurrent.TimeoutException: Timed out before caches had complete views.  Expected 3 members in each view.  Views are as follows: [[ClusterListenerDistTxAddListenerTest-NodeM-60615|3] (4) [ClusterListenerDistTxAddListenerTest-NodeM-60615, ClusterListenerDistTxAddListenerTest-NodeN-60848, ClusterListenerDistTxAddListenerTest-NodeO-57530, ClusterListenerDistTxAddListenerTest-NodeP-7100]]
	at org.infinispan.test.TestingUtil.viewsTimedOut(TestingUtil.java:278)
	at org.infinispan.test.TestingUtil.blockUntilViewsReceived(TestingUtil.java:340)
	at org.infinispan.test.TestingUtil.blockUntilViewsReceived(TestingUtil.java:964)
	at org.infinispan.notifications.cachelistener.cluster.AbstractClusterListenerDistAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener(AbstractClusterListenerDistAddListenerTest.java:249)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:80)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
	at org.testng.TestRunner.privateRun(TestRunner.java:767)
	at org.testng.TestRunner.run(TestRunner.java:617)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
	at org.testng.SuiteRunner.access$000(SuiteRunner.java:37)
	at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:368)
	at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Comment 4 Dan Berindei 2016-01-06 15:51:15 UTC
It doesn't respect the state transfer timeout, indeed. However, this shouldn't be a problem in practice, because the state transfer timeout should be much bigger than the time it takes to elect a new coordinator.

I'm not sure the test failure is related, since the failure seems to happen when a node leaves, and my changes are only affect joining. I haven't been able to reproduce it on my machine, it would be great if you can reproduce it with trace enabled.

Comment 5 Matej Čimbora 2016-01-07 09:12:38 UTC
Fair enough. I'll create a new issue for the random failure.

Comment 9 Red Hat Bugzilla 2025-02-10 03:48:07 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.


Note You need to log in before you can comment on or make changes to this bug.