Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 809042

Summary: View change causes state transfer timeout time to be reduced to RPC timeout time.
Product: [JBoss] JBoss Data Grid 6 Reporter: Michal Linhard <mlinhard>
Component: InfinispanAssignee: Tristan Tarrant <ttarrant>
Status: CLOSED NOTABUG QA Contact: Michal Linhard <mlinhard>
Severity: high Docs Contact:
Priority: high    
Version: 6.0.0CC: dberinde, jdg-bugs, mhusnain, nobody
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
RPC timeout indicates the maximum timeout allowed for remote user operations like GET and PUT. State transfer timeout contains the timeout value allowed for internal remote operations used by the state transfer process (for example, APPLY_STATE). DistributedSyncTimeout is used for cache view installations, which include state tranfers. As a result, the value set for distributedSyncTimeout should be greater than the value of stateTransfer.timeout. </para> <para> However, the state transfer timeout does not change during the state transfer. As a result, if a user operation fails with a TimeoutException, it is because rpcTimeout milliseconds passes, regardless of whether it occurs during a state transfer or not. </para> <para> In this instance, there is an unstable cluster in which nodes 2 and 3 consider themselves part of the cluster, while the remaining nodes in the cluster do not view them as part of the cluster. It takes over one minute to establish a common/stable JGroups cluster view and to perform a state transfer, therefore all pending operations fail.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-20 11:53:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michal Linhard 2012-04-02 10:33:25 UTC
http://www.qa.jboss.com/~mlinhard/hyperion/run44-elas-dist/

lots of clients have get these exceptions:

org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for message id[99888] returned server error (status=0x86): org.infinispan.util.concurrent.TimeoutException: Timed out waiting for the state transfer to end

While doing PUT.
State transfer timeout is set to 10min
RPC timeout is set to 1min. Here it occurs due to RPC timeout

because on server side the exceptions occur here:

05:31:39,449 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (MemcachedServerWorker-2-137) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Timed out waiting for the state transfer to end
	at org.infinispan.interceptors.StateTransferLockInterceptor.handleWithRetries(StateTransferLockInterceptor.java:218) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.StateTransferLockInterceptor.handleWriteCommand(StateTransferLockInterceptor.java:181) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.StateTransferLockInterceptor.visitPutKeyValueCommand(StateTransferLockInterceptor.java:152) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:76) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:124) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:76) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:130) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:89) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:61) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:76) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:345) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:944) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.CacheImpl.put(CacheImpl.java:657) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.DecoratedCache.put(DecoratedCache.java:179) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.AbstractDelegatingCache.put(AbstractDelegatingCache.java:119) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.AbstractProtocolDecoder.put(AbstractProtocolDecoder.scala:196) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.AbstractProtocolDecoder.decodeValue(AbstractProtocolDecoder.scala:150) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.AbstractProtocolDecoder.decode(AbstractProtocolDecoder.scala:72) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.AbstractProtocolDecoder.decode(AbstractProtocolDecoder.scala:45) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.transport.CustomReplayingDecoder.callDecode(CustomReplayingDecoder.java:250) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.transport.CustomReplayingDecoder.messageReceived(CustomReplayingDecoder.java:223) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.AbstractProtocolDecoder.messageReceived(AbstractProtocolDecoder.scala:360) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:351) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:282) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:202) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44) [netty-3.2.6.Final-redhat-1.jar:]
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [rt.jar:1.6.0_30]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [rt.jar:1.6.0_30]
	at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_30]

Comment 1 Misha H. Ali 2012-04-04 22:28:37 UTC
Who is assigned to provide CCFR information for this? I'd like to know who I should follow up with if this information remains blank.

Comment 2 Dan Berindei 2012-04-05 01:57:25 UTC
RPC timeout means the maximum amount of time allowed for remote user operations like GET and PUT. State transfer timeout means the amount of time allowed for internal remote operations that are used by the state transfer process (like APPLY_STATE). DistributedSyncTimeout is used for cache view installations, and cache view installation includes state transfer at the moment - so distributedSyncTimeout should be greater than stateTransfer.timeout.

But the timeout for user operations doesn't really change during state transfer. So if a particular user operation fails with a TimeoutException, it's because rpcTimeout milliseconds passed - it doesn't matter if that happened during a state transfer or not.

In this particular instance we had an unstable cluster and at several points in time nodes 2 and 3 thought they were part of the cluster but the other nodes didn't think the same way. Since it took more than 1 minute to establish a common/stable JGroups cluster view and perform state transfer, it is expected behaviour that all pending user operation would fail.

Comment 3 Misha H. Ali 2012-04-05 09:30:06 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
RPC timeout indicates the maximum timeout allowed for remote user operations like GET and PUT. State transfer timeout contains the timeout value allowed for internal remote operations used by the state transfer process (for example, APPLY_STATE). DistributedSyncTimeout is used for cache view installations, which include state tranfers. As a result, the value set for distributedSyncTimeout should be greater than the value of stateTransfer.timeout.

However, the state transfer timeout does not change during the state transfer. As a result, if a user operation fails with a TimeoutException, it is because rpcTimeout milliseconds passes, regardless of whether it occurs during a state transfer or not.

In this instance, there is an unstable cluster in which nodes 2 and 3 consider themselves part of the cluster, while the remaining nodes in the cluster do not view them as part of the cluster. It takes over one minute to establish a common/stable JGroups cluster view and to perform a state transfer, therefore all pending operations fail.

Comment 4 Misha H. Ali 2012-04-05 10:44:49 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,5 +1,7 @@
 RPC timeout indicates the maximum timeout allowed for remote user operations like GET and PUT. State transfer timeout contains the timeout value allowed for internal remote operations used by the state transfer process (for example, APPLY_STATE). DistributedSyncTimeout is used for cache view installations, which include state tranfers. As a result, the value set for distributedSyncTimeout should be greater than the value of stateTransfer.timeout.
-
+</para>
+<para>
 However, the state transfer timeout does not change during the state transfer. As a result, if a user operation fails with a TimeoutException, it is because rpcTimeout milliseconds passes, regardless of whether it occurs during a state transfer or not.
-
+</para>
+<para>
 In this instance, there is an unstable cluster in which nodes 2 and 3 consider themselves part of the cluster, while the remaining nodes in the cluster do not view them as part of the cluster. It takes over one minute to establish a common/stable JGroups cluster view and to perform a state transfer, therefore all pending operations fail.

Comment 5 JBoss JIRA Server 2012-04-13 09:10:24 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-1961 to Resolved

Comment 6 JBoss JIRA Server 2012-04-13 09:10:24 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-1961

I'm rejecting this one since it's working as designed: user operations always use the sync.replTimeout setting, and not stateTransfer.timeout, even during state transfer.