Bug 1017376

Summary: 4 hour REST soak test With JDG 6.2.0 ER1 starts getting TimeoutException putting values and shows large thread spike
Product: [JBoss] JBoss Data Grid 6 Reporter: Alan Field <afield>
Component: PerformanceAssignee: Nobody <nobody>
Status: VERIFIED --- QA Contact: Martin Gencur <mgencur>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2.0CC: afield
Target Milestone: ---   
Target Release: 6.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
JGroups subsystem may deplete its threadpools with all threads deadlocked (waiting for credits). Requests to this node are not handled, the node itself may issue requests but it never gets the response, which results with TimeoutExceptions from Infinispan. In this situation the node still stays as a part of the cluster. This problem applies both to embedded and server mode. Workaround is increasing UFC.max_credits and MFC.max_credits to high value (suggested value is 10M) or removing these protocols from the JGroups configuration at all.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1017190    

Description Alan Field 2013-10-09 18:10:35 UTC
Description of problem:
I executed a 4 hour soak test using the REST endpoint against the JDG 6.2.0 ER1 server. The job results are here:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-CS/job/jdg-perf-cs-soak-test-rest/7/

The soak test was scheduled to run for 240 minutes. The throughput and response time charts both show a huge drop in performance at 210 minutes:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-CS/job/jdg-perf-cs-soak-test-rest/7/artifact/report/response-time.png
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-CS/job/jdg-perf-cs-soak-test-rest/7/artifact/report/throughput.png

4 nodes in the cluster were running the server, 3 nodes were generating load, and 1 node was controlling the test. The load generating nodes start getting "java.net.SocketTimeoutException: Read timed out" errors, and the server nodes start getting this stack trace:

20:37:03,091 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (http-/172.18.1.1:8080-54) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Node node04/default timed out
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:174) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:508) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:288) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.distribution.BaseDistributionInterceptor.handleNonTxWriteCommand(BaseDistributionInterceptor.java:173) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.distribution.NonTxDistributionInterceptor.visitReplaceCommand(NonTxDistributionInterceptor.java:115) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:54) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.EntryWrappingInterceptor.invokeNextAndApplyChanges(EntryWrappingInterceptor.java:292) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.EntryWrappingInterceptor.setSkipRemoteGetsAndInvokeNextForDataCommand(EntryWrappingInterceptor.java:372) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.EntryWrappingInterceptor.visitReplaceCommand(EntryWrappingInterceptor.java:211) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:54) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor.visitReplaceCommand(NonTransactionalLockingInterceptor.java:98) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:54) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:112) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.commands.AbstractVisitor.visitReplaceCommand(AbstractVisitor.java:42) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:54) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:112) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.commands.AbstractVisitor.visitReplaceCommand(AbstractVisitor.java:42) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:54) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.statetransfer.StateTransferInterceptor.handleNonTxWriteCommand(StateTransferInterceptor.java:206) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.statetransfer.StateTransferInterceptor.visitReplaceCommand(StateTransferInterceptor.java:148) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:54) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.CacheMgmtInterceptor.updateStoreStatistics(CacheMgmtInterceptor.java:148) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.CacheMgmtInterceptor.visitReplaceCommand(CacheMgmtInterceptor.java:139) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:54) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:106) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:70) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.commands.AbstractVisitor.visitReplaceCommand(AbstractVisitor.java:42) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:54) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:321) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1298) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.CacheImpl.replaceInternal(CacheImpl.java:958) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.CacheImpl.replace(CacheImpl.java:949) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.CacheImpl.replace(CacheImpl.java:1361) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.AbstractDelegatingAdvancedCache.replace(AbstractDelegatingAdvancedCache.java:191) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.rest.Server.putOrReplace(Server.scala:426) [infinispan-server-rest-6.0.0.Beta1-redhat-1-classes.jar:]
        at org.infinispan.rest.Server.org$infinispan$rest$Server$$putInCache(Server.scala:393) [infinispan-server-rest-6.0.0.Beta1-redhat-1-classes.jar:]
        at org.infinispan.rest.Server$$anonfun$putEntry$1.apply(Server.scala:373) [infinispan-server-rest-6.0.0.Beta1-redhat-1-classes.jar:]
        at org.infinispan.rest.Server$$anonfun$putEntry$1.apply(Server.scala:357) [infinispan-server-rest-6.0.0.Beta1-redhat-1-classes.jar:]
        at org.infinispan.rest.Server.protectCacheNotFound(Server.scala:500) [infinispan-server-rest-6.0.0.Beta1-redhat-1-classes.jar:]
        at org.infinispan.rest.Server.putEntry(Server.scala:357) [infinispan-server-rest-6.0.0.Beta1-redhat-1-classes.jar:]
        at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) [:1.7.0_40]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_40]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_40]
        at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:167) [resteasy-jaxrs-2.3.6.Final-redhat-1.jar:2.3.6.Final-redhat-1]
        at org.jboss.resteasy.core.ResourceMethod.invokeOnTarget(ResourceMethod.java:269) [resteasy-jaxrs-2.3.6.Final-redhat-1.jar:2.3.6.Final-redhat-1]
        at org.jboss.resteasy.plugins.server.servlet.ServletContainerDispatcher.service(ServletContainerDispatcher.java:208) [resteasy-jaxrs-2.3.6.Final-redhat-1.jar:2.3.6.Final-redhat-1]
        at org.jboss.resteasy.plugins.server.servlet.HttpServletDispatcher.service(HttpServletDispatcher.java:55) [resteasy-jaxrs-2.3.6.Final-redhat-1.jar:2.3.6.Final-redhat-1]
        at org.jboss.resteasy.plugins.server.servlet.HttpServletDispatcher.service(HttpServletDispatcher.java:50) [resteasy-jaxrs-2.3.6.Final-redhat-1.jar:2.3.6.Final-redhat-1]
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:847) [jboss-servlet-api_3.0_spec-1.0.2.Final-redhat-1.jar:1.0.2.Final-redhat-1]
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:295) [jbossweb-7.2.2.Final-redhat-1.jar:7.2.2.Final-redhat-1]
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) [jbossweb-7.2.2.Final-redhat-1.jar:7.2.2.Final-redhat-1]
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) [jbossweb-7.2.2.Final-redhat-1.jar:7.2.2.Final-redhat-1]
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:149) [jbossweb-7.2.2.Final-redhat-1.jar:7.2.2.Final-redhat-1]
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:145) [jbossweb-7.2.2.Final-redhat-1.jar:7.2.2.Final-redhat-1]
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97) [jbossweb-7.2.2.Final-redhat-1.jar:7.2.2.Final-redhat-1]
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102) [jbossweb-7.2.2.Final-redhat-1.jar:7.2.2.Final-redhat-1]
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336) [jbossweb-7.2.2.Final-redhat-1.jar:7.2.2.Final-redhat-1]
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856) [jbossweb-7.2.2.Final-redhat-1.jar:7.2.2.Final-redhat-1]
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:653) [jbossweb-7.2.2.Final-redhat-1.jar:7.2.2.Final-redhat-1]
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920) [jbossweb-7.2.2.Final-redhat-1.jar:7.2.2.Final-redhat-1]
        at java.lang.Thread.run(Thread.java:724) [rt.jar:1.7.0_40]
Caused by: org.jgroups.TimeoutException: timeout sending message to node04/default
        at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:419) [jgroups-3.4.0.Beta1-redhat-1.jar:3.4.0.Beta1-redhat-1]
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:353) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:167) [infinispan-core-6.0.0.Beta1-redhat-1.jar:6.0.0.Beta1-redhat-1]
        ... 72 more
 

I looked through the JVM stats for each node in the cluster and also found a huge spike in threads on the server nodes at the same time: (From 850 threads to 1600 threads)

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-CS/job/jdg-perf-cs-soak-test-rest/7/artifact/report/rest/jvmstats/cluster_nodes/THREAD_COUNT.png 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Alan Field 2013-10-11 14:05:27 UTC
According to Radim, this is another instance of the JGroups deadlock described in JGRP-1675.

Comment 3 Mircea Markus 2013-10-11 14:07:41 UTC
thanks, that should explain in.

Comment 4 Mircea Markus 2013-10-11 14:08:54 UTC
btw, is there any way in which we can get a thread dump is such a situation. That should help a lot with the investigation.

Comment 5 Alan Field 2013-10-11 17:23:31 UTC
This execution of the REST soak test executed with increased FC credits completed without the TimeoutExceptions and thread spike:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/JDG/view/PERF-CS/job/jdg-perf-cs-soak-test-rest/29

Mircea, I don't believe we have an automated way to generate a thread dump, but there is a Byteman script that is triggered when the thread pool reaches a certain size. I am going to try and run the job again with the original configuration to see if I can trigger that script.

Comment 6 Martin Gencur 2013-10-14 11:54:38 UTC
It appears that setting MFC and UFC max_credits to 10M or removing these protocols at all is a workaround for this issue. Removing this issues from Beta blockers as it has a workaround.

Comment 7 JBoss JIRA Server 2013-10-17 20:09:49 UTC
Dan Berindei <dberinde> made a comment on jira JGRP-1675

I have a test that hangs reliably on my machine with 3.4.0.Final: https://github.com/danberindei/infinispan/blob/73ca65f9b629698da68916c720ec4efa0d97e2b4/core/src/test/java/org/infinispan/stress/RemoteGetStressTest.java

The test is using Infinispan, but it should be pretty easy to replicate it with JGroups only if necessary. This is what it does:
* Populate a distributed cache with huge values (100kb), making sure the coordinator doesn't own any entries.
* Launch 500 remote get operations in parallel from a single node, resulting in 1500 response messages (numOwners is 3).
* Slow down the processing of response messages by adding a {{Thread.sleep(Random.nextInt(10))}} to keep the OOB threads busy.

Interesting enough, with max_credits=1m and min_threshold=0.40, the FC logs always show at least 390k remaining credits. Yet the remote get operations still time out after 60 seconds, and the OOB threads on most nodes are stuck in {{FlowControl.decrementIfEnoughCredits()}}.

Comment 8 JBoss JIRA Server 2013-10-17 20:29:17 UTC
Bela Ban <bela> updated the status of jira JGRP-1675 to Reopened

Comment 9 JBoss JIRA Server 2013-10-30 14:46:37 UTC
Bela Ban <bela> made a comment on jira JGRP-1675

Tried to replicate the stress test in JGroups (RemoteGetStressTest ia attached). With 500 threads, the tests completes within 60 and 90 seconds. This is with a random delay de-serializing the BigObject between 1 and 10 ms, and with a DISCARD up-rate of 20%.

If {{USE_SLEEPS}} is changed to false, the test finishes within a few second (between 2 and 5 secs).

Comment 10 JBoss JIRA Server 2013-10-30 14:47:48 UTC
Bela Ban <bela> made a comment on jira JGRP-1675

Tried to replicate the stress test in JGroups (RemoteGetStressTest ia attached). With 500 threads, the tests completes within 60 and 90 seconds. This is with a random delay de-serializing the BigObject between 1 and 10 ms, and with a DISCARD up-rate of 20%.

If {{USE_SLEEPS}} is changed to false, the test finishes within a few second (between 2 and 5 secs).  This is with an OOB pool of min=1/max=5/no-queue.

The delaying and discarding certainly slows things downm, but I never experience a deadlock.

Comment 11 JBoss JIRA Server 2013-10-30 14:48:57 UTC
Bela Ban <bela> made a comment on jira JGRP-1675

The test is on branch JGRP-1675: https://github.com/belaban/JGroups/blob/JGRP-1675/tests/stress/org/jgroups/tests/RemoteGetStressTest.java

Comment 12 JBoss JIRA Server 2013-10-30 16:00:22 UTC
Bela Ban <bela> made a comment on jira JGRP-1675

Tried to replicate the stress test in JGroups ({{RemoteGetStressTest}} is attached). With 500 threads, the tests completes between 60 and 90 seconds. This is with a random delay de-serializing the BigObject between 1 and 10 ms, and with a DISCARD up-rate of 20%.

If {{USE_SLEEPS}} is changed to false, the test finishes within a few seconds (between 2 and 5 secs).  This is with an OOB pool of min=1/max=5/no-queue.

The delaying and discarding certainly slows things down, but I never experienced a deadlock.

Comment 13 JBoss JIRA Server 2013-11-05 10:09:50 UTC
Bela Ban <bela> made a comment on jira JGRP-1675

Works (= doesn't block) with jgroups-udp.radim.xml. The different is the larger thread pools in Radim's config.

Comment 14 JBoss JIRA Server 2013-11-05 10:11:17 UTC
Bela Ban <bela> made a comment on jira JGRP-1675

The test to see if this blocks or not is Dan's UPerf2 (attached)

Comment 15 JBoss JIRA Server 2013-11-05 10:20:26 UTC
Dan Berindei <dberinde> made a comment on jira JGRP-1675

I was able to reproduce the deadlock again with an updated test, here: https://github.com/danberindei/JGroups/blob/JGRP-1675/tests/stress/org/jgroups/tests/RemoteGetStressTest.java

It looks like a regular message can be processed on the INT thread pool:

{noformat}
"INT-2,cluster,D@1395" prio=5 tid=0x32 nid=NA waiting
  java.lang.Thread.State: WAITING
	  at java.lang.Object.wait(Object.java:-1)
	  at org.jgroups.protocols.FlowControl$Credit.decrementIfEnoughCredits(FlowControl.java:567)
	  at org.jgroups.protocols.UFC.handleDownMessage(UFC.java:121)
	  at org.jgroups.protocols.FlowControl.down(FlowControl.java:330)
	  at org.jgroups.protocols.FlowControl.down(FlowControl.java:340)
	  at org.jgroups.protocols.FRAG2.fragment(FRAG2.java:243)
	  at org.jgroups.protocols.FRAG2.down(FRAG2.java:122)
	  at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1024)
	  at org.jgroups.JChannel.down(JChannel.java:760)
	  at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:684)
	  at org.jgroups.blocks.RequestCorrelator.sendResponse(RequestCorrelator.java:516)
	  at org.jgroups.blocks.RequestCorrelator.sendReply(RequestCorrelator.java:506)
	  at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:476)
	  at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:374)
	  at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:247)
	  at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:666)
	  at org.jgroups.JChannel.up(JChannel.java:730)
	  at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1019)
	  at org.jgroups.protocols.FRAG2.up(FRAG2.java:182)
	  at org.jgroups.protocols.FlowControl.up(FlowControl.java:434)
	  at org.jgroups.protocols.FlowControl.up(FlowControl.java:434)
	  at org.jgroups.stack.Protocol.up(Protocol.java:409)
	  at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:294)
	  at org.jgroups.protocols.UNICAST3.removeAndDeliver(UNICAST3.java:796)
	  at org.jgroups.protocols.UNICAST3.handleBatchReceived(UNICAST3.java:752)
	  at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:466)
	  at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:662)
	  at org.jgroups.stack.Protocol.up(Protocol.java:409)
	  at org.jgroups.protocols.TP.passBatchUp(TP.java:1422)
	  at org.jgroups.protocols.TP$BatchHandler.run(TP.java:1574)
	  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	  at java.lang.Thread.run(Thread.java:744)
{noformat}

While debugging, I was able to look at the message batch in UNICAST3.handleBatchReceived, and see that it still has a REPLENISH message waiting to be delivered.

Comment 16 JBoss JIRA Server 2013-11-05 12:33:51 UTC
Dan Berindei <dberinde> made a comment on jira JGRP-1675

There seems to be another scenario where the deadlock happens right at the beginning of the test and the originators don't receive any complete responses:

UFC:REPLENISH messages are usually sent after a message has been delivered to the application. If the application also wants to send a response message, and the response message can't be sent because of insufficient credits, the node won't send REPLENISH messages.

The thread that is trying to send the response will periodically wake up and send a CREDIT_REQUEST message. But the CREDIT_REQUEST message doesn't have the OOB flag, so it will be blocked in UNICAST3 if the recipient is also overloaded and already dropped some OOB messages.

Comment 17 JBoss JIRA Server 2013-11-05 13:21:46 UTC
Radim Vansa <rvansa> made a comment on jira JGRP-1675

[~dan.berindei]: This is the original description of this bug :) CREDIT_REQUEST is not marked as OOB by purpose (there's even a comment about that when the message is created) - the sender (which sends the credit request) should be blocked and NOT send any more messages until the previous ones get processed. The C_REQ shouldn't be needed at all because receiver should keep the credit amount on the sender side and as it approaches the threshold, send him the credit replenishment (and this is OOB).
As you speak about deadlock: the question is why the messages are not eventually retransmitted so that the request is processed - retransmissions are not blocked by FlowControl.

Comment 18 JBoss JIRA Server 2013-11-05 14:29:18 UTC
Bela Ban <bela> made a comment on jira JGRP-1675

UPerf2's GET request uses a GET_FIRST on 2 members. This clogs up the receivers' OOB thread pools, and thus credits are dropped. Sending the CREDIT_REQUESTS as OOB fixes this, but the performance is still slow because of the exhaused OOB pool.

I suggest send the GET as a unicast to the owner and if the owner doesn't respond within N ms (perhaps computed as 30% over the rolling avg time for GETs), then try the other owner (essentially staggered GETs as suggested by Manik).

Alternatively, a GET could be an anycast, but the target list should include only the primary owner, conceptually this is the same as above.

Comment 19 JBoss JIRA Server 2013-11-05 14:57:43 UTC
Radim Vansa <rvansa> made a comment on jira JGRP-1675

The credits (REPLENISH messagge) is not dropped, it is not sent at all. As OOB TP is depleted, there is no delivered message which would trigger the REPLENISH. The CREDIT_REQUEST, that could trigger the REPLENISH, is not delivered either (processed in INT TP) because it is moved to UNICASTx tables and waits for sequential delivery to FlowControl after the rest of OOB messages.

Comment 20 JBoss JIRA Server 2013-11-05 14:57:59 UTC
Radim Vansa <rvansa> made a comment on jira JGRP-1675

The credits (REPLENISH message) is not dropped, it is not sent at all. As OOB TP is depleted, there is no delivered message which would trigger the REPLENISH. The CREDIT_REQUEST, that could trigger the REPLENISH, is not delivered either (processed in INT TP) because it is moved to UNICASTx tables and waits for sequential delivery to FlowControl after the rest of OOB messages.

Comment 21 JBoss JIRA Server 2013-11-05 16:31:14 UTC
Bela Ban <bela> made a comment on jira JGRP-1675

Both RemoteGetStressTest and UPerf2 pass now.

Comment 22 JBoss JIRA Server 2013-11-05 20:07:20 UTC
Dan Berindei <dberinde> made a comment on jira JGRP-1675

[~rvansa] Sorry, I thought the original issue was fixed when the internal thread pool was enabled so I didn't read the description again.

[~belaban], there is still the issue of regular messages being processed on the INT thread pool and potentially blocking the processing of REPLENISH messages. Looks like the INTERNAL messages are correctly separated in a batch in TP.receive(), but then in UNICAST3, Table.removeMany() can return a list containing all kinds of messages - there's no way to request just the messages marked INTERNAL, and the INT thread will process all of them.

Comment 24 JBoss JIRA Server 2013-11-06 06:29:02 UTC
Bela Ban <bela> made a comment on jira JGRP-1675

Internal pool doesn't exist in 3.2, fix is only applicable to 3.5

Comment 25 JBoss JIRA Server 2013-11-06 06:42:34 UTC
Bela Ban <bela> made a comment on jira JGRP-1675

[~dan.berindei] Yes, it is possible that an INT or OOB thread will process messages with different flags in UNICAST or NAKACK. This is as designed, and the probability of this happening is proportional to the thread pool sizes and the types of messages received. So if we have a large OOB pool and receive a lot of OOB messages and only a few regular messages, then chances are that the regular messages will be batched together with the OOB messages in UNICAST or NAKACK and will get passed up to the application via an OOB thread.

Comment 26 JBoss JIRA Server 2013-11-06 08:15:24 UTC
Radim Vansa <rvansa> made a comment on jira JGRP-1675

I'd prefer JGroups to be always working rather than working with probability dependent on INT:OOB threadpool ratio/messages frequency/dark magic. INT TP has much lower value if regular messages which are non-blocking from application perspective may get the thread stuck -> lead to threadpool depletion. I agree with dan that listing messages according.

[~belaban]: Could you specify what's the fix (as this is fixed in 3.5)? Marking the FlowControl messages ad DONT_BUNDLE, telling ISPN developers to fix GET_FIRST flooding or something else? Does the comment about the two tests passing mean "passing as long as these send the get with GET_ALL" (the number of targets doesn't matter)?

Comment 27 JBoss JIRA Server 2013-11-06 08:16:11 UTC
Radim Vansa <rvansa> made a comment on jira JGRP-1675

I'd prefer JGroups to be always working rather than working with probability dependent on INT:OOB threadpool ratio/messages frequency/dark magic. INT TP has much lower value if regular messages which are non-blocking from application perspective may get the thread stuck -> lead to threadpool depletion. I agree with Dan that listing messages according to flags would make the system more reliable.

[~belaban]: Could you specify what's the fix (as this is fixed in 3.5)? Marking the FlowControl messages ad DONT_BUNDLE, telling ISPN developers to fix GET_FIRST flooding or something else? Does the comment about the two tests passing mean "passing as long as these send the get with GET_ALL" (the number of targets doesn't matter)?

Comment 28 JBoss JIRA Server 2013-11-06 08:26:53 UTC
Radim Vansa <rvansa> made a comment on jira JGRP-1675

I'd prefer JGroups to be always working rather than working with probability dependent on INT:OOB threadpool ratio/messages frequency/dark magic. INT TP has much lower value if regular messages which are non-blocking from application perspective may get the thread stuck -> lead to threadpool depletion. I agree with Dan that listing messages according to flags would make the system more reliable.

[~belaban]: -Could you specify what's the fix (as this is fixed in 3.5)? Marking the FlowControl messages ad DONT_BUNDLE, telling ISPN developers to fix GET_FIRST flooding or something else? Does the comment about the two tests passing mean "passing as long as these send the get with GET_ALL" (the number of targets doesn't matter)?-
OK, I've found in another mail that it is marking the credit request as INT | OOB, but it should be mentioned more clearly somewhere in the JIRA.

Comment 29 JBoss JIRA Server 2013-12-10 13:49:34 UTC
Bela Ban <bela> updated the status of jira JGRP-1675 to Resolved