Bug 765759

Summary: StateTransferInProgressException during cluster startup
Product: [JBoss] JBoss Data Grid 6 Reporter: Michal Linhard <mlinhard>
Component: InfinispanAssignee: Tristan Tarrant <ttarrant>
Status: CLOSED UPSTREAM QA Contact: Nobody <nobody>
Severity: high Docs Contact:
Priority: high    
Version: 6.0.0CC: jdg-bugs, nobody
Target Milestone: ---   
Target Release: 6.0.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, when state transfer was started with different relays on different nodes, the lock was denied and a <literal>StateTransferInProgressException</literal> occurred to prevent a deadlock. Despite the timeout not expiring, a "??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" error appeared on the server. This is fixed and the <literal>StateTransferInProgressException</literal> no longer displays when a cluster starts up.
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-02-10 03:14:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Exceptions extracted from the server logs none

Description Michal Linhard 2011-12-09 10:57:31 UTC
https://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-soak-test/6/artifact/report/hotrod/serverlogs.zip
https://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-soak-test/6/artifact/report/memcached/serverlogs.zip
https://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-soak-test/6/artifact/report/rest/serverlogs.zip

In a four node performance tests we're seeing these exceptions during cluster startup:

org.infinispan.distribution.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view X

The timeout is set to 1min, there's no way for us to set it to higher value, cause AS7 Infinispan subsystem doesn't allow setting of this config property.

If I understand the nature of exception corerctly, even if there was a way to set the higher timeout, it's quite suspicious that a state transfer takes that long ( > 1min) since caches are empty.

Comment 1 Michal Linhard 2011-12-09 10:58:41 UTC
Created attachment 544455 [details]
Exceptions extracted from the server logs

Comment 2 Tristan Tarrant 2011-12-12 16:40:41 UTC
I have linked https://issues.jboss.org/browse/AS7-2984 to track progress on adding support for configuring state transfer rehash waiting for distributed caches.

Comment 3 Michal Linhard 2012-01-13 16:30:11 UTC
In the "workaround build"
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-edg-from-source/96/

I can't see this issue anymore.

Comment 6 Michal Linhard 2012-03-07 09:29:22 UTC
Another appearance with JDG 6.0.0.ER2:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard/213/console-edg-perf02/

Comment 7 Michal Linhard 2012-03-16 09:59:17 UTC
Appearance with JDG 6.0.0.ER4 in size16 elasticity test beginning in hyperion:
http://www.qa.jboss.com/~mlinhard/hyperion/run24-bz765759

Comment 8 Michal Linhard 2012-03-19 19:51:32 UTC
Appearance with JDG 6.0.0.ER4 in resilience test:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/34/

Comment 9 Michal Linhard 2012-03-20 08:59:18 UTC
Yesterday I had a chat with dberindei on this. The problem here is that this is not an exception that occurs due to exceeded timeout while waiting on state transfer lock. In our tests we have the timeout set to 10 minutes and e.g. in the resilience test no state transfer took more than 10 sec.

This happens when the lock is denied immediately as a result of state transfer being started with different delays on different nodes (as explained in ISPN-1610) to prevent deadlock:

https://github.com/infinispan/infinispan/blob/master/core/src/main/java/org/infinispan/statetransfer/StateTransferLockImpl.java#L296

The situation that causes this exception to appear is quite normal and expected, but the exception and the error message by which it is reported is not a suitable one.

Comment 10 Michal Linhard 2012-03-20 09:46:57 UTC
Adding link to ISPN-1799 which formulates the problem in terms of previous comment.

Comment 11 JBoss JIRA Server 2012-03-21 04:13:57 UTC
Prabhat Jha <prabhat.jha> made a comment on jira ISPN-1799

Can this be fixed so that it gets fixed in next ER build for JDG?

Comment 12 Michal Linhard 2012-03-26 15:45:04 UTC
As this is marked to be fixed in Infinispan 5.2.0.FINAL, we'll be treating this as a known bug for JDG.6.0.0.ER6 (Beta)

Comment 13 JBoss JIRA Server 2012-03-28 07:31:38 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-1799

Prabhat, I did try to get it into 5.1.3.FINAL, but it needs more polish and tests, so I'm not trying to push it in right now.

Comment 14 JBoss JIRA Server 2012-03-28 14:21:23 UTC
Tristan Tarrant <ttarrant> made a comment on jira ISPN-1799

Dan could you please try and get it in 5.1.4.CR1 ?

Comment 15 JBoss JIRA Server 2012-03-28 14:22:08 UTC
Tristan Tarrant <ttarrant> made a comment on jira ISPN-1799

Dan could you please try and get it in 5.1.4.CR1 or at least a patch for JDG

Comment 16 Michal Linhard 2012-04-04 11:04:32 UTC
Known issue for ER5, ER6, appeared in the tests.

Comment 17 mark yarborough 2012-04-04 13:01:45 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
CCFR - mlinhard

Comment 18 Michal Linhard 2012-04-04 15:12:45 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,10 @@
-CCFR - mlinhard+Cause: This happens when the lock is denied immediately as a result of state transfer being started with different delays on different nodes, to prevent deadlock.
+
+Consequence: It results to this ERROR message in the server:
+"??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?"
+even though timeout hasn't expired.
+
+This however doesn't prevent normal cache operation and the errors cease after the cluster stabilises.
+
+Fix: This issue is still open
+Resolution: N/A

Comment 19 JBoss JIRA Server 2012-04-21 21:18:12 UTC
dex chen <dex80526> made a comment on jira ISPN-1799

we saw this error in our 3 node cluster in which the network connectivity to one of nodes is not reliable and has high latency.  Does this exception have anything to do with the network connectivity and high latency?

Comment 20 JBoss JIRA Server 2012-04-23 07:06:35 UTC
Mircea Markus <mmarkus> updated the status of jira ISPN-1799 to Resolved

Comment 21 JBoss JIRA Server 2012-04-23 07:06:35 UTC
Mircea Markus <mmarkus> made a comment on jira ISPN-1799

integrated on 5.1 and master.

Comment 22 JBoss JIRA Server 2012-04-23 15:11:51 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-1799

Dex, if you have high latency or high losses then your cluster could split into two or more partitions, and during the split/merge it's possible to see these StateTransferInProgressExceptions and you can ignore them.

However, you should try to avoid these cluster changes by increasing your FD timeout, as saturating the network with the state transfer might lead to yet another split.

Comment 23 Michal Linhard 2012-05-03 08:34:29 UTC
The StateTransferInProgressException doesn't occur anymore during cluster startup.

Comment 24 Misha H. Ali 2012-06-04 02:30:16 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,10 +1 @@
-Cause: This happens when the lock is denied immediately as a result of state transfer being started with different delays on different nodes, to prevent deadlock.
+Previously, when state transfer was started with different relays on different nodes, the lock was denied and a StateTransferInProgressException occurred to prevent a deadlock. Despite the timeout not expiring, a "??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" error appeared on the server. This is fixed and the StateTransferInProgressException no longer displays when a cluster starts up.-
-Consequence: It results to this ERROR message in the server:
-"??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?"
-even though timeout hasn't expired.
-
-This however doesn't prevent normal cache operation and the errors cease after the cluster stabilises.
-
-Fix: This issue is still open
-Resolution: N/A

Comment 25 Misha H. Ali 2012-06-04 02:52:23 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Previously, when state transfer was started with different relays on different nodes, the lock was denied and a StateTransferInProgressException occurred to prevent a deadlock. Despite the timeout not expiring, a "??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" error appeared on the server. This is fixed and the StateTransferInProgressException no longer displays when a cluster starts up.+Previously, when state transfer was started with different relays on different nodes, the lock was denied and a <literal>StateTransferInProgressException</literal> occurred to prevent a deadlock. Despite the timeout not expiring, a "??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" error appeared on the server. This is fixed and the <literal>StateTransferInProgressException</literal> no longer displays when a cluster starts up.

Comment 30 Red Hat Bugzilla 2025-02-10 03:14:30 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.