https://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-soak-test/6/artifact/report/hotrod/serverlogs.zip https://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-soak-test/6/artifact/report/memcached/serverlogs.zip https://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-soak-test/6/artifact/report/rest/serverlogs.zip In a four node performance tests we're seeing these exceptions during cluster startup: org.infinispan.distribution.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view X The timeout is set to 1min, there's no way for us to set it to higher value, cause AS7 Infinispan subsystem doesn't allow setting of this config property. If I understand the nature of exception corerctly, even if there was a way to set the higher timeout, it's quite suspicious that a state transfer takes that long ( > 1min) since caches are empty.
Created attachment 544455 [details] Exceptions extracted from the server logs
I have linked https://issues.jboss.org/browse/AS7-2984 to track progress on adding support for configuring state transfer rehash waiting for distributed caches.
In the "workaround build" http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-edg-from-source/96/ I can't see this issue anymore.
Appeared in http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-client-stress-test-hotrod/220/artifact/report/size4/serverlogs.zip with 6.0.0.DR1
Appeared again in JDG 6.0.0.ER2: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard-perflab/60/console-perf02/
Another appearance with JDG 6.0.0.ER2: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard/213/console-edg-perf02/
Appearance with JDG 6.0.0.ER4 in size16 elasticity test beginning in hyperion: http://www.qa.jboss.com/~mlinhard/hyperion/run24-bz765759
Appearance with JDG 6.0.0.ER4 in resilience test: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-failover-dist-basic/34/
Yesterday I had a chat with dberindei on this. The problem here is that this is not an exception that occurs due to exceeded timeout while waiting on state transfer lock. In our tests we have the timeout set to 10 minutes and e.g. in the resilience test no state transfer took more than 10 sec. This happens when the lock is denied immediately as a result of state transfer being started with different delays on different nodes (as explained in ISPN-1610) to prevent deadlock: https://github.com/infinispan/infinispan/blob/master/core/src/main/java/org/infinispan/statetransfer/StateTransferLockImpl.java#L296 The situation that causes this exception to appear is quite normal and expected, but the exception and the error message by which it is reported is not a suitable one.
Adding link to ISPN-1799 which formulates the problem in terms of previous comment.
Prabhat Jha <prabhat.jha> made a comment on jira ISPN-1799 Can this be fixed so that it gets fixed in next ER build for JDG?
As this is marked to be fixed in Infinispan 5.2.0.FINAL, we'll be treating this as a known bug for JDG.6.0.0.ER6 (Beta)
Dan Berindei <dberinde> made a comment on jira ISPN-1799 Prabhat, I did try to get it into 5.1.3.FINAL, but it needs more polish and tests, so I'm not trying to push it in right now.
Tristan Tarrant <ttarrant> made a comment on jira ISPN-1799 Dan could you please try and get it in 5.1.4.CR1 ?
Tristan Tarrant <ttarrant> made a comment on jira ISPN-1799 Dan could you please try and get it in 5.1.4.CR1 or at least a patch for JDG
Known issue for ER5, ER6, appeared in the tests.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: CCFR - mlinhard
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,10 @@ -CCFR - mlinhard+Cause: This happens when the lock is denied immediately as a result of state transfer being started with different delays on different nodes, to prevent deadlock. + +Consequence: It results to this ERROR message in the server: +"??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" +even though timeout hasn't expired. + +This however doesn't prevent normal cache operation and the errors cease after the cluster stabilises. + +Fix: This issue is still open +Resolution: N/A
dex chen <dex80526> made a comment on jira ISPN-1799 we saw this error in our 3 node cluster in which the network connectivity to one of nodes is not reliable and has high latency. Does this exception have anything to do with the network connectivity and high latency?
Mircea Markus <mmarkus> updated the status of jira ISPN-1799 to Resolved
Mircea Markus <mmarkus> made a comment on jira ISPN-1799 integrated on 5.1 and master.
Dan Berindei <dberinde> made a comment on jira ISPN-1799 Dex, if you have high latency or high losses then your cluster could split into two or more partitions, and during the split/merge it's possible to see these StateTransferInProgressExceptions and you can ignore them. However, you should try to avoid these cluster changes by increasing your FD timeout, as saturating the network with the state transfer might lead to yet another split.
The StateTransferInProgressException doesn't occur anymore during cluster startup.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,10 +1 @@ -Cause: This happens when the lock is denied immediately as a result of state transfer being started with different delays on different nodes, to prevent deadlock. +Previously, when state transfer was started with different relays on different nodes, the lock was denied and a StateTransferInProgressException occurred to prevent a deadlock. Despite the timeout not expiring, a "??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" error appeared on the server. This is fixed and the StateTransferInProgressException no longer displays when a cluster starts up.- -Consequence: It results to this ERROR message in the server: -"??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" -even though timeout hasn't expired. - -This however doesn't prevent normal cache operation and the errors cease after the cluster stabilises. - -Fix: This issue is still open -Resolution: N/A
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Previously, when state transfer was started with different relays on different nodes, the lock was denied and a StateTransferInProgressException occurred to prevent a deadlock. Despite the timeout not expiring, a "??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" error appeared on the server. This is fixed and the StateTransferInProgressException no longer displays when a cluster starts up.+Previously, when state transfer was started with different relays on different nodes, the lock was denied and a <literal>StateTransferInProgressException</literal> occurred to prevent a deadlock. Despite the timeout not expiring, a "??:??:??,??? ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (undefined) ISPN000136: Execution error: org.infinispan.statetransfer.StateTransferInProgressException: Timed out waiting for the state transfer lock, state transfer in progress for view ?" error appeared on the server. This is fixed and the <literal>StateTransferInProgressException</literal> no longer displays when a cluster starts up.
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.