Bug 534979 (RHQ-1722)

Summary: Way too many getAvailability threads being spawned
Product: [Other] RHQ Project Reporter: Charles Crouch <ccrouch>
Component: Plugin ContainerAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: high    
Version: unspecifiedCC: cwelton, hbrock, jshaughn, mazz
Target Milestone: ---Keywords: SubBug
Target Release: ---   
Hardware: All   
OS: All   
URL: http://jira.rhq-project.org/browse/RHQ-1722
Whiteboard:
Fixed In Version: 1.4 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-05 11:18:58 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 565628, 741450    
Attachments:
Description Flags
blocking_thread_dump.txt
none
server.log.zip none

Description Charles Crouch 2009-03-04 15:38:00 EST
Due to the locking problem described here: https://jira.jboss.org/jira/browse/JBAS-6546, carrying out certain operations on the JBAS5 plugin will block, e.g. when deleting a datasource.
In this case if you do a threaddump of the JVM (this is through the embedded console so it contains the JBAS instance being monitored plus the plugin container) you will see a large number of threads blocked calling

org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getAvailability(ManagedDeploymentComponent.java:133)

I see currently >300 such threads on my app server. A new thread is added every few seconds. Regardless of whether our availability check hangs (which it appears to be) we should not be continually creating new threads.

In the logs I'm seeing things like...

2009-03-04 13:56:42,296 DEBUG [org.rhq.core.pc.inventory.AvailabilityExecutor] (InventoryManager.availability-1) Failed to collect availability on resource Resource[id=-21, type=JBoss Service (SAR), key=vfsfile:/C:/usr/apps/jboss/jboss-5.1.0.Beta1/server/default/deploy/jsr88-service.xml, name=jsr88-service.xml, version=?] (call timed out)
2009-03-04 13:56:42,296 DEBUG [org.jboss.remoting.InvokerRegistry] (ResourceContainer.invoker.daemon-223) decremented org.jboss.remoting.transport.local.LocalClientInvoker@77075d's count, current count 2
...

2009-03-04 14:28:57,562 DEBUG [org.rhq.core.pc.inventory.AvailabilityExecutor] (InventoryManager.availability-1) Failed to collect availability on resource Resource[id=-46, type=JBoss Service (SAR), key=vfsfile:/C:/usr/apps/jboss/jboss-5.1.0.Beta1/server/default/deploy/ejb2-timer-service.xml, name=ejb2-timer-service.xml, version=?] (call timed out)
2009-03-04 14:28:57,562 DEBUG [org.jboss.remoting.InvokerRegistry] (ResourceContainer.invoker.daemon-547) decremented org.jboss.remoting.transport.local.LocalClientInvoker@77075d's count, current count 323


There are 547-223=324 threads in the dump, so it looks like every time the avail check times out a new thread is generated. In the log snippet above ResourceContainer.invoker.daemon-547 thread is not mentioned in the log anywhere before this message.
Comment 1 Charles Crouch 2009-03-04 15:41:43 EST
blocking_thread_dump.txt contains the output of the ServerInfo threaddump method
Comment 2 Charles Crouch 2009-03-04 15:47:24 EST
Could be related to RHQ-1720
Comment 3 Charles Crouch 2009-03-05 10:19:10 EST
server.log.zip is the entire server (and embedded console) log over the period when this problem was seen.
Comment 4 John Mazzitelli 2009-03-05 11:03:13 EST
More logs of interest:

2009-03-04 13:57:49,062 DEBUG [org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler] (InventoryManager.availability-1) Call to org.rhq.plugins.jbossas5.ManagedComponentComponent.getAvailability() with args [] timed out. Interrupting the invocation thread...

2009-03-04 13:57:49,062 DEBUG [org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler] (InventoryManager.availability-1) Facet lock status for [Resource[id=-12, type=Queue, key=JMSDestination:Queue:queue3, name=queue3, version=?]], is-write-locked=[false], is-write-locked-by-current-thread=[false], read-locks=[0], waiting-for-lock-queue-size=[0]

2009-03-04 13:57:49,062 DEBUG [org.rhq.core.pc.inventory.AvailabilityExecutor] (InventoryManager.availability-1) Failed to collect availability on resource Resource[id=-12, type=Queue, key=JMSDestination:Queue:queue3, name=queue3, version=?] (call timed out)

2009-03-04 14:59:25,046 DEBUG [org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler] (InventoryManager.availability-1) Call to org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getAvailability() with args [] failed.
java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.jboss.profileservice.spi.NoSuchDeploymentException: Managed deployment: vfsfile:/C:/usr/apps/jboss/jboss-5.1.0.Beta1/server/default/deploy/queue3-service.xml not found.
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:215)
	at java.util.concurrent.FutureTask.get(FutureTask.java:85)
	at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invokeInNewThreadWithLock(ResourceContainer.java:398)
	at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invoke(ResourceContainer.java:381)
	at $Proxy225.getAvailability(Unknown Source)
	at org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(AvailabilityExecutor.java:167)
	at org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(AvailabilityExecutor.java:210)
	at org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(AvailabilityExecutor.java:210)
	at org.rhq.core.pc.inventory.AvailabilityExecutor.call(AvailabilityExecutor.java:106)
	at org.rhq.core.pc.inventory.AvailabilityExecutor.run(AvailabilityExecutor.java:70)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
	at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:280)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:135)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:65)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:142)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:166)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
	at java.lang.Thread.run(Thread.java:595)
Caused by: java.lang.RuntimeException: org.jboss.profileservice.spi.NoSuchDeploymentException: Managed deployment: vfsfile:/C:/usr/apps/jboss/jboss-5.1.0.Beta1/server/default/deploy/queue3-service.xml not found.
	at org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getAvailability(ManagedDeploymentComponent.java:136)
	at sun.reflect.GeneratedMethodAccessor323.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:466)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
	at java.util.concurrent.FutureTask.run(FutureTask.java:123)
	... 3 more
Caused by: org.jboss.profileservice.spi.NoSuchDeploymentException: Managed deployment: vfsfile:/C:/usr/apps/jboss/jboss-5.1.0.Beta1/server/default/deploy/queue3-service.xml not found.
	at org.jboss.profileservice.management.ManagementViewImpl.getDeployment(ManagementViewImpl.java:710)
	at sun.reflect.GeneratedMethodAccessor316.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at org.jboss.aop.Dispatcher.invoke(Dispatcher.java:121)
	at org.jboss.aspects.remoting.AOPRemotingInvocationHandler.invoke(AOPRemotingInvocationHandler.java:82)
	at org.jboss.profileservice.remoting.ProfileServiceInvocationHandler.invoke(ProfileServiceInvocationHandler.java:99)
	at org.jboss.remoting.ServerInvoker.invoke(ServerInvoker.java:908)
	at org.jboss.remoting.transport.local.LocalClientInvoker.invoke(LocalClientInvoker.java:106)
	at org.jboss.remoting.Client.invoke(Client.java:1708)
	at org.jboss.remoting.Client.invoke(Client.java:612)
	at org.jboss.aspects.remoting.InvokeRemoteInterceptor.invoke(InvokeRemoteInterceptor.java:60)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102)
	at org.jboss.aspects.remoting.MergeMetaDataInterceptor.invoke(MergeMetaDataInterceptor.java:74)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102)
	at org.jboss.aspects.security.SecurityClientInterceptor.invoke(SecurityClientInterceptor.java:65)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102)
	at AOPProxy$1.getDeployment(AOPProxy$1.java)
	at org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getManagedDeployment(ManagedDeploymentComponent.java:338)
	at org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getAvailability(ManagedDeploymentComponent.java:133)
	at sun.reflect.GeneratedMethodAccessor323.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:466)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
	at java.util.concurrent.FutureTask.run(FutureTask.java:123)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
	at java.lang.Thread.run(Thread.java:595)
	at org.jboss.aspects.remoting.InvokeRemoteInterceptor.invoke(InvokeRemoteInterceptor.java:72)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102)
	at org.jboss.aspects.remoting.MergeMetaDataInterceptor.invoke(MergeMetaDataInterceptor.java:74)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102)
	at org.jboss.aspects.security.SecurityClientInterceptor.invoke(SecurityClientInterceptor.java:65)
	at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102)
	at AOPProxy$1.getDeployment(AOPProxy$1.java)
	at org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getManagedDeployment(ManagedDeploymentComponent.java:338)
	at org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getAvailability(ManagedDeploymentComponent.java:133)
	... 9 more
Comment 5 John Mazzitelli 2009-03-05 11:24:43 EST
Here's what I think.

We put the invocation call on a thread pool. Its a thread pool created by Executors.newCachedThreadPool so it is essentially unbounded but all idle threads are killed after 60 seconds. If the JBossAS call blocks indefinitely, those threads in the pool will never exit.

Now, we use futures to capture timed out invocations (that's what those debug messages are saying, "Call to  ... timed out. Interrupting the invocation thread... "). We call future.cancel(true) (whichs sends an interrupt to the thread):

      } catch (java.util.concurrent.TimeoutException e) {
         ...
          LOG.debug("Call to " + methodName + " with args " + methodArgs+ " timed out. Interrupting the invocation thread...");
         ...
         future.cancel(true);

If that calling thread doesn't honor or listen for this interrupt, that thread will never die and the thread pool will grow unbounded.

Our thread that is collecting availabilities continues on, but this thread calling the JBossAS API is still running and hung and there is nothing we can do about it, other than what we've already tried to do (that being, we interrupt it).

There are a couple things to consider:

1) have a thread pool per plugin - this would allow each plugin to not have to share pools that may have gone haywire. This however, doesn't really solve anything - because this rogue plugin will still spawn threads rapidly and degrade the VM - regardless of what the other plugins do

2) we could bound the pool - thus prohibiting the rogue plugin from being able to spawn too many threads. But this means its possible that this one rogue plugin could use up all the threads in the pool, thus starving the other resources/plugins that need to do things (and thus breaking everthing in the process since no plugins will be able to do anything).

3) We could combine 1 and 2 - providing each plugin a bounded thread pool. But a) do we really know the appropriate high-water mark for the max number of threads to allow? To high and we have the same problem as now, too low and we've artificially crippled the plugins in case they need more. And if we have N plugins, that means we'll potentially have N * max-threads-allowed threads in the system - which defeats the purpose of trying to limit the number of threads in the system (its conceivable that we'll have 10 or more plugins in a system - if our max allowed threads is 100, that means we'll have up to 1000 threads in the system potentially. You've seen problem with about half of those in your thread dump. So I'm not sure what improvement this would have.

In short, I do not see a really good and effective way to solve the problem of a rogue plugin deadlocking and causing threads to spawn wildly (short of fixing the plugin :).
Comment 6 John Mazzitelli 2009-03-05 11:55:41 EST
BTW: i've already thought about adding thread checks to the VM Health Check thread:

http://jira.rhq-project.org/browse/RHQ-1035?focusedCommentId=12834#action_12834

"3) (optional) implement a check for thread deadlocks - if a thread is deadlocked, reset agent "

http://support.rhq-project.org/display/RHQ/Design-AgentVMHealthCheck

"In the future, we could have this health check thread look for deadlocked threads and it could either a) attempt to kill those threads if possible or b) perform an agent shutdown/restart like it does for the critically low memory condition."

I did not implement this because of the problems we see in this case. Even if we have detected deadlocked threads (or even just an abnormal number of threads), we aren't assured at being able to kill them (even if we attempt to interrupt them or shutdown the plugin container) because there is no guarantee the threads will kill themselves. I haven't investigated if the deprecated "Thread.stop()" method can be used, would be appropriate and would work. But that is an option - analogous to a "kill -9" when a simple "kill" doesn't work.
Comment 7 Joseph Marques 2009-03-20 02:21:16 EDT
after reading mazz's thread, it doesn't look like there is a quick fix to this issue, and thus won't be released in time for 1.2 release, pushing out.
Comment 8 Corey Welton 2009-08-05 14:05:29 EDT
Push to  1.4
Comment 9 Red Hat Bugzilla 2009-11-10 15:45:18 EST
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1722
Imported an attachment (id=368627)
Imported an attachment (id=368628)
Comment 10 wes hayutin 2010-02-16 11:56:20 EST
Temporarily adding the keyword "SubBug" so we can be sure we have accounted for all the bugs.

keyword:
new = Tracking + FutureFeature + SubBug
Comment 11 wes hayutin 2010-02-16 12:01:30 EST
making sure we're not missing any bugs in rhq_triage
Comment 12 Corey Welton 2010-08-30 13:28:36 EDT
mazz, which solution is preferential here?
Comment 13 John Mazzitelli 2010-08-30 13:53:13 EDT
(In reply to comment #12)
> mazz, which solution is preferential here?

Adding a thread-check-and-destroy mechanism to the VM Health Check thread. That thread already has the infrastructure in place to periodically check the health of the VM - all we have to do is add the code to actually check the threads in the VM and kill them if appropriate.
Comment 14 Jay Shaughnessy 2012-02-28 15:23:53 EST
Asking mazz for an update on this. Is there something we can do here or
should this potentially be closed?
Comment 15 John Mazzitelli 2012-02-28 15:31:26 EST
I know Lukas recently looked at some threading issues - he didn't run into this I don't think.  This problem may have gone away already.

We could attempt to play games as my previous suggestion mentions, but dont' know how easy/useful/effective it would be:

"Adding a thread-check-and-destroy mechanism to the VM Health Check thread. That
thread already has the infrastructure in place to periodically check the health
of the VM - all we have to do is add the code to actually check the threads in
the VM and kill them if appropriate."