Bug 809042 - View change causes state transfer timeout time to be reduced to RPC timeout time.
Summary: View change causes state transfer timeout time to be reduced to RPC timeout t...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.0.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Tristan Tarrant
QA Contact: Michal Linhard
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-04-02 10:33 UTC by Michal Linhard
Modified: 2014-03-17 04:02 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
RPC timeout indicates the maximum timeout allowed for remote user operations like GET and PUT. State transfer timeout contains the timeout value allowed for internal remote operations used by the state transfer process (for example, APPLY_STATE). DistributedSyncTimeout is used for cache view installations, which include state tranfers. As a result, the value set for distributedSyncTimeout should be greater than the value of stateTransfer.timeout. </para> <para> However, the state transfer timeout does not change during the state transfer. As a result, if a user operation fails with a TimeoutException, it is because rpcTimeout milliseconds passes, regardless of whether it occurs during a state transfer or not. </para> <para> In this instance, there is an unstable cluster in which nodes 2 and 3 consider themselves part of the cluster, while the remaining nodes in the cluster do not view them as part of the cluster. It takes over one minute to establish a common/stable JGroups cluster view and to perform a state transfer, therefore all pending operations fail.
Clone Of:
Environment:
Last Closed: 2012-04-20 11:53:55 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-1961 0 Critical Resolved View change causes state transfer timeout to be reduced to rpc timeout time 2013-07-15 05:19:11 UTC

Description Michal Linhard 2012-04-02 10:33:25 UTC
http://www.qa.jboss.com/~mlinhard/hyperion/run44-elas-dist/

lots of clients have get these exceptions:

org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for message id[99888] returned server error (status=0x86): org.infinispan.util.concurrent.TimeoutException: Timed out waiting for the state transfer to end

While doing PUT.
State transfer timeout is set to 10min
RPC timeout is set to 1min. Here it occurs due to RPC timeout

because on server side the exceptions occur here:

05:31:39,449 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (MemcachedServerWorker-2-137) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Timed out waiting for the state transfer to end
	at org.infinispan.interceptors.StateTransferLockInterceptor.handleWithRetries(StateTransferLockInterceptor.java:218) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.StateTransferLockInterceptor.handleWriteCommand(StateTransferLockInterceptor.java:181) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.StateTransferLockInterceptor.visitPutKeyValueCommand(StateTransferLockInterceptor.java:152) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:76) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:124) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:76) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:130) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:89) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:61) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:76) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:345) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:944) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.CacheImpl.put(CacheImpl.java:657) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.DecoratedCache.put(DecoratedCache.java:179) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.AbstractDelegatingCache.put(AbstractDelegatingCache.java:119) [infinispan-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.AbstractProtocolDecoder.put(AbstractProtocolDecoder.scala:196) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.AbstractProtocolDecoder.decodeValue(AbstractProtocolDecoder.scala:150) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.AbstractProtocolDecoder.decode(AbstractProtocolDecoder.scala:72) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.AbstractProtocolDecoder.decode(AbstractProtocolDecoder.scala:45) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.transport.CustomReplayingDecoder.callDecode(CustomReplayingDecoder.java:250) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.transport.CustomReplayingDecoder.messageReceived(CustomReplayingDecoder.java:223) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.infinispan.server.core.AbstractProtocolDecoder.messageReceived(AbstractProtocolDecoder.scala:360) [infinispan-server-core-5.1.3.ER6-redhat-1.jar:5.1.3.ER6-redhat-1]
	at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:351) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:282) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:202) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.2.6.Final-redhat-1.jar:]
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44) [netty-3.2.6.Final-redhat-1.jar:]
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [rt.jar:1.6.0_30]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [rt.jar:1.6.0_30]
	at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_30]

Comment 1 Misha H. Ali 2012-04-04 22:28:37 UTC
Who is assigned to provide CCFR information for this? I'd like to know who I should follow up with if this information remains blank.

Comment 2 Dan Berindei 2012-04-05 01:57:25 UTC
RPC timeout means the maximum amount of time allowed for remote user operations like GET and PUT. State transfer timeout means the amount of time allowed for internal remote operations that are used by the state transfer process (like APPLY_STATE). DistributedSyncTimeout is used for cache view installations, and cache view installation includes state transfer at the moment - so distributedSyncTimeout should be greater than stateTransfer.timeout.

But the timeout for user operations doesn't really change during state transfer. So if a particular user operation fails with a TimeoutException, it's because rpcTimeout milliseconds passed - it doesn't matter if that happened during a state transfer or not.

In this particular instance we had an unstable cluster and at several points in time nodes 2 and 3 thought they were part of the cluster but the other nodes didn't think the same way. Since it took more than 1 minute to establish a common/stable JGroups cluster view and perform state transfer, it is expected behaviour that all pending user operation would fail.

Comment 3 Misha H. Ali 2012-04-05 09:30:06 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
RPC timeout indicates the maximum timeout allowed for remote user operations like GET and PUT. State transfer timeout contains the timeout value allowed for internal remote operations used by the state transfer process (for example, APPLY_STATE). DistributedSyncTimeout is used for cache view installations, which include state tranfers. As a result, the value set for distributedSyncTimeout should be greater than the value of stateTransfer.timeout.

However, the state transfer timeout does not change during the state transfer. As a result, if a user operation fails with a TimeoutException, it is because rpcTimeout milliseconds passes, regardless of whether it occurs during a state transfer or not.

In this instance, there is an unstable cluster in which nodes 2 and 3 consider themselves part of the cluster, while the remaining nodes in the cluster do not view them as part of the cluster. It takes over one minute to establish a common/stable JGroups cluster view and to perform a state transfer, therefore all pending operations fail.

Comment 4 Misha H. Ali 2012-04-05 10:44:49 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,5 +1,7 @@
 RPC timeout indicates the maximum timeout allowed for remote user operations like GET and PUT. State transfer timeout contains the timeout value allowed for internal remote operations used by the state transfer process (for example, APPLY_STATE). DistributedSyncTimeout is used for cache view installations, which include state tranfers. As a result, the value set for distributedSyncTimeout should be greater than the value of stateTransfer.timeout.
-
+</para>
+<para>
 However, the state transfer timeout does not change during the state transfer. As a result, if a user operation fails with a TimeoutException, it is because rpcTimeout milliseconds passes, regardless of whether it occurs during a state transfer or not.
-
+</para>
+<para>
 In this instance, there is an unstable cluster in which nodes 2 and 3 consider themselves part of the cluster, while the remaining nodes in the cluster do not view them as part of the cluster. It takes over one minute to establish a common/stable JGroups cluster view and to perform a state transfer, therefore all pending operations fail.

Comment 5 JBoss JIRA Server 2012-04-13 09:10:24 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-1961 to Resolved

Comment 6 JBoss JIRA Server 2012-04-13 09:10:24 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-1961

I'm rejecting this one since it's working as designed: user operations always use the sync.replTimeout setting, and not stateTransfer.timeout, even during state transfer.


Note You need to log in before you can comment on or make changes to this bug.