http://www.qa.jboss.com/~mlinhard/hyperion/run44-elas-dist/ lots of clients have get these exceptions: org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for message id[99888] returned server error (status=0x86): org.infinispan.util.concurrent.TimeoutException: Timed out waiting for the state transfer to end While doing PUT. State transfer timeout is set to 10min RPC timeout is set to 1min. Here it occurs due to RPC timeout because on server side the exceptions occur here: 05:31:39,449 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (MemcachedServerWorker-2-137) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Timed out waiting for the state transfer to end at org.infinispan.interceptors.StateTransferLockInterceptor.handleWithRetries(StateTransferLockInterceptor.java:218) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.interceptors.StateTransferLockInterceptor.handleWriteCommand(StateTransferLockInterceptor.java:181) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.interceptors.StateTransferLockInterceptor.visitPutKeyValueCommand(StateTransferLockInterceptor.java:152) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:76) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:124) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:76) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:130) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:89) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:61) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:76) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:345) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:944) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.CacheImpl.put(CacheImpl.java:657) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.DecoratedCache.put(DecoratedCache.java:179) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.AbstractDelegatingCache.put(AbstractDelegatingCache.java:119) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.server.core.AbstractProtocolDecoder.put(AbstractProtocolDecoder.scala:196) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.server.core.AbstractProtocolDecoder.decodeValue(AbstractProtocolDecoder.scala:150) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.server.core.AbstractProtocolDecoder.decode(AbstractProtocolDecoder.scala:72) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.server.core.AbstractProtocolDecoder.decode(AbstractProtocolDecoder.scala:45) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.server.core.transport.CustomReplayingDecoder.callDecode(CustomReplayingDecoder.java:250) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.server.core.transport.CustomReplayingDecoder.messageReceived(CustomReplayingDecoder.java:223) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.infinispan.server.core.AbstractProtocolDecoder.messageReceived(AbstractProtocolDecoder.scala:360) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1] at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80) [netty-3.2.6.Final-redhat-1.jar:] at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.2.6.Final-redhat-1.jar:] at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) [netty-3.2.6.Final-redhat-1.jar:] at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274) [netty-3.2.6.Final-redhat-1.jar:] at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261) [netty-3.2.6.Final-redhat-1.jar:] at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:351) [netty-3.2.6.Final-redhat-1.jar:] at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:282) [netty-3.2.6.Final-redhat-1.jar:] at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:202) [netty-3.2.6.Final-redhat-1.jar:] at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.2.6.Final-redhat-1.jar:] at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44) [netty-3.2.6.Final-redhat-1.jar:] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [rt.jar:1.6.0_30] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [rt.jar:1.6.0_30] at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_30]
Who is assigned to provide CCFR information for this? I'd like to know who I should follow up with if this information remains blank.
RPC timeout means the maximum amount of time allowed for remote user operations like GET and PUT. State transfer timeout means the amount of time allowed for internal remote operations that are used by the state transfer process (like APPLY_STATE). DistributedSyncTimeout is used for cache view installations, and cache view installation includes state transfer at the moment - so distributedSyncTimeout should be greater than stateTransfer.timeout. But the timeout for user operations doesn't really change during state transfer. So if a particular user operation fails with a TimeoutException, it's because rpcTimeout milliseconds passed - it doesn't matter if that happened during a state transfer or not. In this particular instance we had an unstable cluster and at several points in time nodes 2 and 3 thought they were part of the cluster but the other nodes didn't think the same way. Since it took more than 1 minute to establish a common/stable JGroups cluster view and perform state transfer, it is expected behaviour that all pending user operation would fail.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: RPC timeout indicates the maximum timeout allowed for remote user operations like GET and PUT. State transfer timeout contains the timeout value allowed for internal remote operations used by the state transfer process (for example, APPLY_STATE). DistributedSyncTimeout is used for cache view installations, which include state tranfers. As a result, the value set for distributedSyncTimeout should be greater than the value of stateTransfer.timeout. However, the state transfer timeout does not change during the state transfer. As a result, if a user operation fails with a TimeoutException, it is because rpcTimeout milliseconds passes, regardless of whether it occurs during a state transfer or not. In this instance, there is an unstable cluster in which nodes 2 and 3 consider themselves part of the cluster, while the remaining nodes in the cluster do not view them as part of the cluster. It takes over one minute to establish a common/stable JGroups cluster view and to perform a state transfer, therefore all pending operations fail.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,5 +1,7 @@ RPC timeout indicates the maximum timeout allowed for remote user operations like GET and PUT. State transfer timeout contains the timeout value allowed for internal remote operations used by the state transfer process (for example, APPLY_STATE). DistributedSyncTimeout is used for cache view installations, which include state tranfers. As a result, the value set for distributedSyncTimeout should be greater than the value of stateTransfer.timeout. - +</para> +<para> However, the state transfer timeout does not change during the state transfer. As a result, if a user operation fails with a TimeoutException, it is because rpcTimeout milliseconds passes, regardless of whether it occurs during a state transfer or not. - +</para> +<para> In this instance, there is an unstable cluster in which nodes 2 and 3 consider themselves part of the cluster, while the remaining nodes in the cluster do not view them as part of the cluster. It takes over one minute to establish a common/stable JGroups cluster view and to perform a state transfer, therefore all pending operations fail.
Dan Berindei <dberinde> updated the status of jira ISPN-1961 to Resolved
Dan Berindei <dberinde> made a comment on jira ISPN-1961 I'm rejecting this one since it's working as designed: user operations always use the sync.replTimeout setting, and not stateTransfer.timeout, even during state transfer.