Bug 765759 - StateTransferInProgressException during cluster startup
Summary: StateTransferInProgressException during cluster startup
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.0.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 6.0.0
Assignee: Tristan Tarrant
QA Contact: Nobody
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-12-09 10:57 UTC by Michal Linhard
Modified: 2025-02-10 03:14 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2025-02-10 03:14:30 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Exceptions extracted from the server logs (30.09 KB, text/plain)
2011-12-09 10:58 UTC, Michal Linhard
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker AS7-2984 0 Minor Resolved Add wait attribute to rehashing element of distributed caches 2016-07-25 10:16:19 UTC
Red Hat Issue Tracker ISPN-1610 0 Minor Resolved Timeouts waiting for StateTransferLock 2016-07-25 10:16:19 UTC
Red Hat Issue Tracker ISPN-1799 0 Major Resolved We should avoid using exceptions for flow control when acquiring state transfer lock 2016-07-25 10:16:19 UTC

Description Michal Linhard 2011-12-09 10:57:31 UTC
https://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-soak-test/6/artifact/report/hotrod/serverlogs.zip
https://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-soak-test/6/artifact/report/memcached/serverlogs.zip
https://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-soak-test/6/artifact/report/rest/serverlogs.zip

In a four node performance tests we're seeing these exceptions during cluster startup:

org.infinispan.distribution.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view X

The timeout is set to 1min, there's no way for us to set it to higher value, cause AS7 Infinispan subsystem doesn't allow setting of this config property.

If I understand the nature of exception corerctly, even if there was a way to set the higher timeout, it's quite suspicious that a state transfer takes that long ( > 1min) since caches are empty.

Comment 1 Michal Linhard 2011-12-09 10:58:41 UTC
Created attachment 544455 [details]
Exceptions extracted from the server logs

Comment 2 Tristan Tarrant 2011-12-12 16:40:41 UTC
I have linked https://issues.jboss.org/browse/AS7-2984 to track progress on adding support for configuring state transfer rehash waiting for distributed caches.

Comment 3 Michal Linhard 2012-01-13 16:30:11 UTC
In the "workaround build"
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-edg-from-source/96/

I can't see this issue anymore.

Comment 6 Michal Linhard 2012-03-07 09:29:22 UTC
Another appearance with JDG 6.0.0.ER2:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard/213/console-edg-perf02/

Comment 7 Michal Linhard 2012-03-16 09:59:17 UTC
Appearance with JDG 6.0.0.ER4 in size16 elasticity test beginning in hyperion:
http://www.qa.jboss.com/~mlinhard/hyperion/run24-bz765759

Comment 8 Michal Linhard 2012-03-19 19:51:32 UTC
Appearance with JDG 6.0.0.ER4 in resilience test:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/34/

Comment 9 Michal Linhard 2012-03-20 08:59:18 UTC
Yesterday I had a chat with dberindei on this. The problem here is that this is not an exception that occurs due to exceeded timeout while waiting on state transfer lock. In our tests we have the timeout set to 10 minutes and e.g. in the resilience test no state transfer took more than 10 sec.

This happens when the lock is denied immediately as a result of state transfer being started with different delays on different nodes (as explained in ISPN-1610) to prevent deadlock:

https://github.com/infinispan/infinispan/blob/master/core/src/main/java/org/infinispan/statetransfer/StateTransferLockImpl.java#L296

The situation that causes this exception to appear is quite normal and expected, but the exception and the error message by which it is reported is not a suitable one.

Comment 10 Michal Linhard 2012-03-20 09:46:57 UTC
Adding link to ISPN-1799 which formulates the problem in terms of previous comment.

Comment 11 JBoss JIRA Server 2012-03-21 04:13:57 UTC
Prabhat Jha <prabhat.jha> made a comment on jira ISPN-1799

Can this be fixed so that it gets fixed in next ER build for JDG?

Comment 12 Michal Linhard 2012-03-26 15:45:04 UTC
As this is marked to be fixed in Infinispan 5.2.0.FINAL, we'll be treating this as a known bug for JDG.6.0.0.ER6 (Beta)

Comment 13 JBoss JIRA Server 2012-03-28 07:31:38 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-1799

Prabhat, I did try to get it into 5.1.3.FINAL, but it needs more polish and tests, so I'm not trying to push it in right now.

Comment 14 JBoss JIRA Server 2012-03-28 14:21:23 UTC
Tristan Tarrant <ttarrant> made a comment on jira ISPN-1799

Dan could you please try and get it in 5.1.4.CR1 ?

Comment 15 JBoss JIRA Server 2012-03-28 14:22:08 UTC
Tristan Tarrant <ttarrant> made a comment on jira ISPN-1799

Dan could you please try and get it in 5.1.4.CR1 or at least a patch for JDG

Comment 16 Michal Linhard 2012-04-04 11:04:32 UTC
Known issue for ER5, ER6, appeared in the tests.

Comment 17 mark yarborough 2012-04-04 13:01:45 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
CCFR - mlinhard

Comment 18 Michal Linhard 2012-04-04 15:12:45 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,10 @@
-CCFR - mlinhard+Cause: This happens when the lock is denied immediately as a result of state transfer being started with different delays on different nodes, to prevent deadlock.
+
+Consequence: It results to this ERROR message in the server:
+"??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?"
+even though timeout hasn't expired.
+
+This however doesn't prevent normal cache operation and the errors cease after the cluster stabilises.
+
+Fix: This issue is still open
+Resolution: N/A

Comment 19 JBoss JIRA Server 2012-04-21 21:18:12 UTC
dex chen <dex80526> made a comment on jira ISPN-1799

we saw this error in our 3 node cluster in which the network connectivity to one of nodes is not reliable and has high latency.  Does this exception have anything to do with the network connectivity and high latency?

Comment 20 JBoss JIRA Server 2012-04-23 07:06:35 UTC
Mircea Markus <mmarkus> updated the status of jira ISPN-1799 to Resolved

Comment 21 JBoss JIRA Server 2012-04-23 07:06:35 UTC
Mircea Markus <mmarkus> made a comment on jira ISPN-1799

integrated on 5.1 and master.

Comment 22 JBoss JIRA Server 2012-04-23 15:11:51 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-1799

Dex, if you have high latency or high losses then your cluster could split into two or more partitions, and during the split/merge it's possible to see these StateTransferInProgressExceptions and you can ignore them.

However, you should try to avoid these cluster changes by increasing your FD timeout, as saturating the network with the state transfer might lead to yet another split.

Comment 23 Michal Linhard 2012-05-03 08:34:29 UTC
The StateTransferInProgressException doesn't occur anymore during cluster startup.

Comment 24 Misha H. Ali 2012-06-04 02:30:16 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,10 +1 @@
-Cause: This happens when the lock is denied immediately as a result of state transfer being started with different delays on different nodes, to prevent deadlock.
+Previously, when state transfer was started with different relays on different nodes, the lock was denied and a StateTransferInProgressException occurred to prevent a deadlock. Despite the timeout not expiring, a "??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" error appeared on the server. This is fixed and the StateTransferInProgressException no longer displays when a cluster starts up.-
-Consequence: It results to this ERROR message in the server:
-"??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?"
-even though timeout hasn't expired.
-
-This however doesn't prevent normal cache operation and the errors cease after the cluster stabilises.
-
-Fix: This issue is still open
-Resolution: N/A

Comment 25 Misha H. Ali 2012-06-04 02:52:23 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Previously, when state transfer was started with different relays on different nodes, the lock was denied and a StateTransferInProgressException occurred to prevent a deadlock. Despite the timeout not expiring, a "??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" error appeared on the server. This is fixed and the StateTransferInProgressException no longer displays when a cluster starts up.+Previously, when state transfer was started with different relays on different nodes, the lock was denied and a <literal>StateTransferInProgressException</literal> occurred to prevent a deadlock. Despite the timeout not expiring, a "??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" error appeared on the server. This is fixed and the <literal>StateTransferInProgressException</literal> no longer displays when a cluster starts up.

Comment 30 Red Hat Bugzilla 2025-02-10 03:14:30 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.


Note You need to log in before you can comment on or make changes to this bug.