Bug 1259418
| Summary: | StateTransferManager.waitForInitialTransferToComplete can fail if the coordinator crashes | ||
|---|---|---|---|
| Product: | [JBoss] JBoss Data Grid 6 | Reporter: | Dan Berindei <dberinde> |
| Component: | Infinispan | Assignee: | Tristan Tarrant <ttarrant> |
| Status: | CLOSED UPSTREAM | QA Contact: | Martin Gencur <mgencur> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.6.0 | CC: | jdg-bugs, mgencur, ttarrant, vjuranek |
| Target Milestone: | DR3 | ||
| Target Release: | 6.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1255665 | Environment: | |
| Last Closed: | 2025-02-10 03:48:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1255665 | ||
| Bug Blocks: | |||
|
Description
Dan Berindei
2015-09-02 14:37:49 UTC
Looking at the code I'm wondering whether this respects configuration.clustering.stateTransfer.timeout setting. If coordinator leaves while we're checking whether rebalancing is enabled, this can end up waiting indefinitely in LocalTopologyManagerImpl.isRebalancingEnabled:343 - transport.waitForView(nextViewId). Apart from this, there's a new random failure in ClusterListenerDistTxAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener. org.infinispan.util.concurrent.TimeoutException: Timed out before caches had complete views. Expected 3 members in each view. Views are as follows: [[ClusterListenerDistTxAddListenerTest-NodeM-60615|3] (4) [ClusterListenerDistTxAddListenerTest-NodeM-60615, ClusterListenerDistTxAddListenerTest-NodeN-60848, ClusterListenerDistTxAddListenerTest-NodeO-57530, ClusterListenerDistTxAddListenerTest-NodeP-7100]] at org.infinispan.test.TestingUtil.viewsTimedOut(TestingUtil.java:278) at org.infinispan.test.TestingUtil.blockUntilViewsReceived(TestingUtil.java:340) at org.infinispan.test.TestingUtil.blockUntilViewsReceived(TestingUtil.java:964) at org.infinispan.notifications.cachelistener.cluster.AbstractClusterListenerDistAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener(AbstractClusterListenerDistAddListenerTest.java:249) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:80) at org.testng.internal.Invoker.invokeMethod(Invoker.java:714) at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901) at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231) at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127) at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111) at org.testng.TestRunner.privateRun(TestRunner.java:767) at org.testng.TestRunner.run(TestRunner.java:617) at org.testng.SuiteRunner.runTest(SuiteRunner.java:334) at org.testng.SuiteRunner.access$000(SuiteRunner.java:37) at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:368) at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) It doesn't respect the state transfer timeout, indeed. However, this shouldn't be a problem in practice, because the state transfer timeout should be much bigger than the time it takes to elect a new coordinator. I'm not sure the test failure is related, since the failure seems to happen when a node leaves, and my changes are only affect joining. I haven't been able to reproduce it on my machine, it would be great if you can reproduce it with trace enabled. Fair enough. I'll create a new issue for the random failure. This product has been discontinued or is no longer tracked in Red Hat Bugzilla. |