1259418 – StateTransferManager.waitForInitialTransferToComplete can fail if the coordinator crashes

Bug 1259418 - StateTransferManager.waitForInitialTransferToComplete can fail if the coordinator crashes

Summary: StateTransferManager.waitForInitialTransferToComplete can fail if the coordin...

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	DR3
Target Release:	6.6.0
Assignee:	Tristan Tarrant
QA Contact:	Martin Gencur
Docs Contact:
URL:
Whiteboard:
Depends On:	1255665
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-02 14:37 UTC by Dan Berindei
Modified:	2025-02-10 03:48 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:	1255665
Environment:
Last Closed:	2025-02-10 03:48:07 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-5459	0	Critical	Resolved	StateTransferManager.waitForInitialTransferToComplete can fail if the coordinator crashes	2016-01-14 10:35:41 UTC

Description Dan Berindei 2015-09-02 14:37:49 UTC

+++ This bug was initially created as a clone of Bug #1255665 +++

Please see https://issues.jboss.org/browse/ISPN-5459

Same issue is present in JDG 6.5.1.ER1, ClusterListenerDistTxAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener fails with the same exception

Comment 2 Dan Berindei 2015-09-02 14:45:29 UTC

PR: https://github.com/infinispan/jdg/pull/738

Comment 3 Matej Čimbora 2016-01-05 11:47:35 UTC

Looking at the code I'm wondering whether this respects configuration.clustering.stateTransfer.timeout setting. If coordinator leaves while we're checking whether rebalancing is enabled, this can end up waiting indefinitely in LocalTopologyManagerImpl.isRebalancingEnabled:343 - transport.waitForView(nextViewId).

Apart from this, there's a new random failure in ClusterListenerDistTxAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener.

org.infinispan.util.concurrent.TimeoutException: Timed out before caches had complete views.  Expected 3 members in each view.  Views are as follows: [[ClusterListenerDistTxAddListenerTest-NodeM-60615|3] (4) [ClusterListenerDistTxAddListenerTest-NodeM-60615, ClusterListenerDistTxAddListenerTest-NodeN-60848, ClusterListenerDistTxAddListenerTest-NodeO-57530, ClusterListenerDistTxAddListenerTest-NodeP-7100]]
	at org.infinispan.test.TestingUtil.viewsTimedOut(TestingUtil.java:278)
	at org.infinispan.test.TestingUtil.blockUntilViewsReceived(TestingUtil.java:340)
	at org.infinispan.test.TestingUtil.blockUntilViewsReceived(TestingUtil.java:964)
	at org.infinispan.notifications.cachelistener.cluster.AbstractClusterListenerDistAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener(AbstractClusterListenerDistAddListenerTest.java:249)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:80)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
	at org.testng.TestRunner.privateRun(TestRunner.java:767)
	at org.testng.TestRunner.run(TestRunner.java:617)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
	at org.testng.SuiteRunner.access$000(SuiteRunner.java:37)
	at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:368)
	at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Comment 4 Dan Berindei 2016-01-06 15:51:15 UTC

It doesn't respect the state transfer timeout, indeed. However, this shouldn't be a problem in practice, because the state transfer timeout should be much bigger than the time it takes to elect a new coordinator.

I'm not sure the test failure is related, since the failure seems to happen when a node leaves, and my changes are only affect joining. I haven't been able to reproduce it on my machine, it would be great if you can reproduce it with trace enabled.

Comment 5 Matej Čimbora 2016-01-07 09:12:38 UTC

Fair enough. I'll create a new issue for the random failure.

Comment 9 Red Hat Bugzilla 2025-02-10 03:48:07 UTC

This product has been discontinued or is no longer tracked in Red Hat Bugzilla.

Note You need to log in before you can comment on or make changes to this bug.