Bug 1013001

Summary:	[ISPN-3375] Map/Reduce jobs with small value sizes fail because of timeouts putting intermediate keys into the cache
Product:	[JBoss] JBoss Data Grid 6	Reporter:	Alan Field <afield>
Component:	Embedded	Assignee:	Tristan Tarrant <ttarrant>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Martin Gencur <mgencur>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.2.0	CC:	dmehra, gsheldon, jdg-bugs, mhusnain, myarboro, sjacobs
Target Milestone:	---
Target Release:	6.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	<para> When a Map/Reduce task is executed against a cache with many small values, the task may experience <literal>TimeoutExceptions</literal> when writing the intermediate results to the cache. As a result, when these exceptions occur, the task fails. </para> <para> Several workarounds are available for this issue, including: <orderedlist> <listitem> <para> Use larger values in the cache where the Map/Reduce task is executed. </para> </listitem> <listitem> <para> Use a collator and/or a combiner on the Map/Reduce task if this is possible. </para> </listitem> <listitem> <para> Configure the size of the intermediate chunks written to the cache using the state transfer chunkSize parameter. </para> </listitem> </orderedlist> </para> </listitem>	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-01-26 14:05:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1104639

Description Alan Field 2013-09-27 14:50:28 UTC

Description of problem:

The following exception occurs on the primary node running a Map/Reduce job with 1KB values:
12:57:32,755 ERROR [org.radargun.stages.MapReduceStage] (pool-1-thread-1) executeMapReduceTask() returned an exception
org.infinispan.CacheException: java.util.concurrent.ExecutionException: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from edg-perf03-15313, see cause for remote stack trace
	at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:369)
	at org.radargun.cachewrappers.InfinispanMapReduceWrapper.executeMapReduceTask(InfinispanMapReduceWrapper.java:113)
	at org.radargun.stages.MapReduceStage.executeMapReduceTask(MapReduceStage.java:207)
	at org.radargun.stages.MapReduceStage.executeOnSlave(MapReduceStage.java:149)
	at org.radargun.Slave$2.run(Slave.java:103)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:724)
Caused by: java.util.concurrent.ExecutionException: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from edg-perf03-15313, see cause for remote stack trace
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
	at java.util.concurrent.FutureTask.get(FutureTask.java:111)
	at org.infinispan.distexec.mapreduce.MapReduceTask$TaskPart.get(MapReduceTask.java:867)
	at org.infinispan.distexec.mapreduce.MapReduceTask.executeMapPhase(MapReduceTask.java:461)
	at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:363)
	... 10 more
Caused by: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from edg-perf03-15313, see cause for remote stack trace
	at org.infinispan.remoting.transport.AbstractTransport.checkResponse(AbstractTransport.java:70)
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:384)
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:189)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:531)
	at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:303)
	at org.infinispan.remoting.rpc.RpcManagerImpl$2.call(RpcManagerImpl.java:340)
	... 5 more
Caused by: org.infinispan.CacheException: org.infinispan.CacheException: Could not move intermediate keys/values for M/R task edc7a51c-d25a-4650-b39a-8d8a87d7cd61
	at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.mapAndCombineForDistributedReduction(MapReduceManagerImpl.java:114)
	at org.infinispan.commands.read.MapCombineCommand.perform(MapCombineCommand.java:93)
	at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122)
	at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)
	at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)
	... 3 more
Caused by: org.infinispan.CacheException: Could not move intermediate keys/values for M/R task edc7a51c-d25a-4650-b39a-8d8a87d7cd61
	at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.combine(MapReduceManagerImpl.java:331)
	at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.mapAndCombineForDistributedReduction(MapReduceManagerImpl.java:112)
	... 7 more
Caused by: org.infinispan.util.concurrent.TimeoutException: Node edg-perf02-56077 timed out
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:196)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:531)
	at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:303)
	at org.infinispan.interceptors.distribution.BaseDistributionInterceptor.handleNonTxWriteCommand(BaseDistributionInterceptor.java:151)
	at org.infinispan.interceptors.distribution.NonTxDistributionInterceptor.visitPutKeyValueCommand(NonTxDistributionInterceptor.java:91)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.EntryWrappingInterceptor.invokeNextAndApplyChanges(EntryWrappingInterceptor.java:290)
	at org.infinispan.interceptors.EntryWrappingInterceptor.visitPutKeyValueCommand(EntryWrappingInterceptor.java:157)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.locking.AbstractLockingInterceptor.visitPutKeyValueCommand(AbstractLockingInterceptor.java:70)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:134)
	at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:54)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:134)
	at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:54)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.statetransfer.StateTransferInterceptor.handleTopologyAffectedCommand(StateTransferInterceptor.java:216)
	at org.infinispan.statetransfer.StateTransferInterceptor.handleWriteCommand(StateTransferInterceptor.java:194)
	at org.infinispan.statetransfer.StateTransferInterceptor.visitPutKeyValueCommand(StateTransferInterceptor.java:136)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:160)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:128)
	at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:92)
	at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:54)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
	at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1337)
	at org.infinispan.CacheImpl.putInternal(CacheImpl.java:898)
	at org.infinispan.CacheImpl.put(CacheImpl.java:890)
	at org.infinispan.CacheImpl.put(CacheImpl.java:1390)
	at org.infinispan.CacheImpl.put(CacheImpl.java:229)
	at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.combine(MapReduceManagerImpl.java:326)
	... 8 more
Caused by: org.jgroups.TimeoutException: timeout sending message to edg-perf02-56077
	at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:419)
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:375)
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:189)
	... 48 more
And this exception is occurring on the other node in the cluster:
12:57:32,484 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread-0) ISPN000136: Execution error
org.infinispan.util.concurrent.TimeoutException: Node edg-perf02-56077 timed out
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:196)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:531)
	at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:303)
	at org.infinispan.interceptors.distribution.BaseDistributionInterceptor.handleNonTxWriteCommand(BaseDistributionInterceptor.java:151)
	at org.infinispan.interceptors.distribution.NonTxDistributionInterceptor.visitPutKeyValueCommand(NonTxDistributionInterceptor.java:91)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.EntryWrappingInterceptor.invokeNextAndApplyChanges(EntryWrappingInterceptor.java:290)
	at org.infinispan.interceptors.EntryWrappingInterceptor.visitPutKeyValueCommand(EntryWrappingInterceptor.java:157)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.locking.AbstractLockingInterceptor.visitPutKeyValueCommand(AbstractLockingInterceptor.java:70)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:134)
	at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:54)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:134)
	at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:54)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.statetransfer.StateTransferInterceptor.handleTopologyAffectedCommand(StateTransferInterceptor.java:216)
	at org.infinispan.statetransfer.StateTransferInterceptor.handleWriteCommand(StateTransferInterceptor.java:194)
	at org.infinispan.statetransfer.StateTransferInterceptor.visitPutKeyValueCommand(StateTransferInterceptor.java:136)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:160)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:128)
	at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:92)
	at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:54)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
	at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1337)
	at org.infinispan.CacheImpl.putInternal(CacheImpl.java:898)
	at org.infinispan.CacheImpl.put(CacheImpl.java:890)
	at org.infinispan.CacheImpl.put(CacheImpl.java:1390)
	at org.infinispan.CacheImpl.put(CacheImpl.java:229)
	at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.combine(MapReduceManagerImpl.java:326)
	at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.mapAndCombineForDistributedReduction(MapReduceManagerImpl.java:112)
	at org.infinispan.commands.read.MapCombineCommand.perform(MapCombineCommand.java:93)
	at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122)
	at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)
	at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:724)
Caused by: org.jgroups.TimeoutException: timeout sending message to edg-perf02-56077
	at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:419)
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:375)
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:189)
	... 48 more
12:57:32,494 ERROR [org.infinispan.remoting.InboundInvocationHandlerImpl] (remote-thread-0) Exception executing command
org.infinispan.CacheException: org.infinispan.CacheException: Could not move intermediate keys/values for M/R task edc7a51c-d25a-4650-b39a-8d8a87d7cd61
	at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.mapAndCombineForDistributedReduction(MapReduceManagerImpl.java:114)
	at org.infinispan.commands.read.MapCombineCommand.perform(MapCombineCommand.java:93)
	at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122)
	at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)
	at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:724)
Caused by: org.infinispan.CacheException: Could not move intermediate keys/values for M/R task edc7a51c-d25a-4650-b39a-8d8a87d7cd61
	at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.combine(MapReduceManagerImpl.java:331)
	at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.mapAndCombineForDistributedReduction(MapReduceManagerImpl.java:112)
	... 7 more
Caused by: org.infinispan.util.concurrent.TimeoutException: Node edg-perf02-56077 timed out
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:196)
	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:531)
	at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:303)
	at org.infinispan.interceptors.distribution.BaseDistributionInterceptor.handleNonTxWriteCommand(BaseDistributionInterceptor.java:151)
	at org.infinispan.interceptors.distribution.NonTxDistributionInterceptor.visitPutKeyValueCommand(NonTxDistributionInterceptor.java:91)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.EntryWrappingInterceptor.invokeNextAndApplyChanges(EntryWrappingInterceptor.java:290)
	at org.infinispan.interceptors.EntryWrappingInterceptor.visitPutKeyValueCommand(EntryWrappingInterceptor.java:157)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.locking.AbstractLockingInterceptor.visitPutKeyValueCommand(AbstractLockingInterceptor.java:70)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:134)
	at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:54)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:134)
	at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:54)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.statetransfer.StateTransferInterceptor.handleTopologyAffectedCommand(StateTransferInterceptor.java:216)
	at org.infinispan.statetransfer.StateTransferInterceptor.handleWriteCommand(StateTransferInterceptor.java:194)
	at org.infinispan.statetransfer.StateTransferInterceptor.visitPutKeyValueCommand(StateTransferInterceptor.java:136)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:160)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
	at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:128)
	at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:92)
	at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:54)
	at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:83)
	at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
	at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1337)
	at org.infinispan.CacheImpl.putInternal(CacheImpl.java:898)
	at org.infinispan.CacheImpl.put(CacheImpl.java:890)
	at org.infinispan.CacheImpl.put(CacheImpl.java:1390)
	at org.infinispan.CacheImpl.put(CacheImpl.java:229)
	at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.combine(MapReduceManagerImpl.java:326)
	... 8 more
Caused by: org.jgroups.TimeoutException: timeout sending message to edg-perf02-56077
	at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:419)
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:375)
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:189)
	... 48 more
ISPN-2836 added a timeout for the entire MapReduceTask, but this timeout is happening when the intermediate keys are put into the cache by the map phase. This error does not occur when the value size is 8KB or larger.

Comment 2 JBoss JIRA Server 2013-09-30 13:58:23 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3375

[~afield] can you still reproduce this? I'm not sure if TRACE logs are feasible, but they would be nice to have. If this appears/ed in Jenkins, please add a link to the Jenkins job as well.

Comment 5 JBoss JIRA Server 2013-09-30 16:43:36 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3375

[~afield], let's try with org.infinispan.distexec, org.infinispan.interceptors.locking and org.infinispan.remoting.InboundInvocationHandlerImpl. Actually, since TRACE will slow things down considerably, maybe it would be best to do a test run without any TRACE logs, just to verify that the issue still appears.

Comment 7 JBoss JIRA Server 2013-10-02 16:00:31 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3375

The error with TRACE logs is a little different, but it still happens while inserting intermediary values in the temporary cache:

{noformat}
17:06:28,331 ERROR [org.radargun.stages.MapReduceStage] (pool-1-thread-1) executeMapReduceTask() returned an exception
org.infinispan.commons.CacheException: java.util.concurrent.ExecutionException: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from edg-perf07-26990, see cause for remote stack trace
Caused by: java.util.concurrent.ExecutionException: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from edg-perf07-26990, see cause for remote stack trace
Caused by: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from edg-perf07-26990, see cause for remote stack trace
Caused by: org.infinispan.commons.CacheException: org.infinispan.commons.CacheException: Could not move intermediate keys/values for M/R task 739f64b9-af84-4625-8023-5974fc835d92
Caused by: org.infinispan.commons.CacheException: Could not move intermediate keys/values for M/R task 739f64b9-af84-4625-8023-5974fc835d92
Caused by: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from edg-perf08-36579, see cause for remote stack traceCaused by: org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [10 seconds] on key [tÃ±Ã¹Ã±] for requestor [Thread[remote-thread-53,5,main]]! Lock held by [null]
{noformat}

The map/reduce task seems to generate a lot of traffic. Because it doesn't use a combiner (with {{combinedWith(Reducer)}}), after the map phase some of the intermediary keys end up having 400k values in them. All the intermediary values generated on a node for a key are inserted in the temporary cache with a single PutKeyValueCommand. At 5 bytes per value (4-byte int + 1-byte tag), the serialized command ends up weighing 2MB, and with all the other traffic in the cluster it takes > 5s to transfer it. 

Since the intermediary cache is non-transactional, for each PutKeyValueCommand the primary owner acquires the lock, forwards the command to the backup owner, and releases the lock only when it receives an answer from the backup. When there are more than 3 concurrent commands all trying to write to the same key, it's somewhat expected that some of them will take more than 10 seconds (and time out). It's not as easy to explain the timeout in the builds without TRACE, since the replication timeout is 60s, but the excessive replication times could be caused by UNICAST2 and the large number of UFC credits.

It might be useful to add a {{chunkSize}} attribute to {{MapReduceTask}}, allowing the user to split the values inserted in the temporary cache. But I'm pretty sure the error could be eliminated with some changes in the test and its configuration:
* Add a combiner for the M/R task (should probably be the same as the reducer).
* Replace UNICAST2 with UNICAST3
* Reduce {{UFC.max_credits}} to 1m and reduce {{FRAG2.frag_size}} to 30k.

The huge traffic also caused nodes to be frequently suspected by FD_ALL (although only one, perf03, was confirmed by VERIFY_SUSPECT). FD_ALL is below NAKACK2 in the protocol stack, so heartbeats are not retransmitted. {{FD_ALL.timeout}} is 15s and {{FD_ALL.interval}} is 8s (the default), so it's enough to lose one heartbeat for FD_ALL to suspect the node, and VERIFY_SUSPECT will confirm it after one more missed heartbeat. The interval should be much smaller, e.g. 4s.

Comment 8 JBoss JIRA Server 2013-10-04 16:14:12 UTC

Alan Field <afield> made a comment on jira ISPN-3375

[~dan.berindei] The trace job was executed with a combiner. (Search for collatorFqn= in the logs to see it being passed to the MapReduceStage) I have also run the job with fewer credits, and the timeout still occurs. I'll keep trying your suggestions, and the error does go away if the size of the values used is larger. I don't think that means this is less a problem, though since changing the value size is not always possible. I have been wondering if there is some way to avoid the contention in the intermediate cache by using key affinity and then combining the results in the reduce phase.

Comment 9 JBoss JIRA Server 2013-10-08 14:10:22 UTC

Alan Field <afield> made a comment on jira ISPN-3375

[~dan.berindei] OK, I have rerun the tests with the suggested changes to the JGroups configuration, and the TimeoutException are still occurring. (With collator: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/user/afield@REDHAT.COM/my-views/view/afield's%20jobs/job/jdg-radargun-mapreduce-test/230/ and without collator: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/user/afield@REDHAT.COM/my-views/view/afield's%20jobs/job/jdg-radargun-mapreduce-test/229/) I can get another trace log, if you like.

Comment 10 JBoss JIRA Server 2013-10-08 14:11:38 UTC

Alan Field <afield> made a comment on jira ISPN-3375

Updated JGroups configuration file for M/R task

Comment 11 JBoss JIRA Server 2013-10-09 13:58:05 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3375

[~afield], I think MapReduceStage still doesn't use a combiner. Note that the combiner and the collator are different things, the pipeline goes like this: entries -> mapper (on entry node) -> combiner (on entry node) -> reducer (on intermediate node) -> collator (on originator).

OTOH, I'd really like to see a log with TRACE enabled for org.jgroups and org.infinispan.remoting, as 60 seconds should be plenty even with 2MB entries (especially with UNICAST3).

Comment 12 JBoss JIRA Server 2013-10-09 20:33:18 UTC

Alan Field <afield> made a comment on jira ISPN-3375

[~dan.berindei] Yes, I missed the addition of the combiner in Infinispan 5.2. I just sent a PR to add support for it in RadarGun, so I'll be able to try to see how it affects the performance of this scenario. I am also running this Jenkins job with TRACE enabled for org.jgroups and org.infinispan.remoting:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/user/afield@REDHAT.COM/my-views/view/afield's%20jobs/job/jdg-radargun-mapreduce-test/232/

Comment 13 JBoss JIRA Server 2013-10-10 08:18:27 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3375

I guess the logs were too much, it looks like the job was interrupted after 8 hours and it wasn't even finished with inserting the initial data. Let's wait until we see the results with the PR integrated.

Comment 14 JBoss JIRA Server 2013-10-10 13:34:19 UTC

Alan Field <afield> made a comment on jira ISPN-3375

[~dan.berindei] Yeah, I could try a run with the previous log4j settings, but I have another bug I need to investigate in the short term.

Comment 18 JBoss JIRA Server 2013-12-05 20:48:33 UTC

Alan Field <afield> updated the status of jira ISPN-3375 to Reopened

Comment 19 JBoss JIRA Server 2013-12-05 20:48:33 UTC

Alan Field <afield> made a comment on jira ISPN-3375

I am reopening this issue, because changing the chunkSize has ni impact on the TimeoutException.

Comment 20 Alan Field 2013-12-05 20:50:09 UTC

Ditto what I said in the JIRA

Comment 21 Mircea Markus 2013-12-20 10:06:59 UTC

Fixing this requires some complex changes in the M/R framework that are not suitable within the scope of JDG 6.2.0. These were discussed with product management (Divya) and scheduled for JDG 6.3.

Comment 23 JBoss JIRA Server 2014-06-10 11:59:32 UTC

Alan Field <afield> updated the status of jira ISPN-3375 to Resolved

Comment 24 Alan Field 2014-06-10 12:00:18 UTC

Verified that JDG 6.3.0 ER6 does not generate any timeout or suspect exceptions when small values are used in Map/Reduce jobs

Comment 25 Martin Gencur 2014-06-10 12:13:30 UTC

Excellent! Thanks