Due to the locking problem described here: https://jira.jboss.org/jira/browse/JBAS-6546, carrying out certain operations on the JBAS5 plugin will block, e.g. when deleting a datasource. In this case if you do a threaddump of the JVM (this is through the embedded console so it contains the JBAS instance being monitored plus the plugin container) you will see a large number of threads blocked calling org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getAvailability(ManagedDeploymentComponent.java:133) I see currently >300 such threads on my app server. A new thread is added every few seconds. Regardless of whether our availability check hangs (which it appears to be) we should not be continually creating new threads. In the logs I'm seeing things like... 2009-03-04 13:56:42,296 DEBUG [org.rhq.core.pc.inventory.AvailabilityExecutor] (InventoryManager.availability-1) Failed to collect availability on resource Resource[id=-21, type=JBoss Service (SAR), key=vfsfile:/C:/usr/apps/jboss/jboss-5.1.0.Beta1/server/default/deploy/jsr88-service.xml, name=jsr88-service.xml, version=?] (call timed out) 2009-03-04 13:56:42,296 DEBUG [org.jboss.remoting.InvokerRegistry] (ResourceContainer.invoker.daemon-223) decremented org.jboss.remoting.transport.local.LocalClientInvoker@77075d's count, current count 2 ... 2009-03-04 14:28:57,562 DEBUG [org.rhq.core.pc.inventory.AvailabilityExecutor] (InventoryManager.availability-1) Failed to collect availability on resource Resource[id=-46, type=JBoss Service (SAR), key=vfsfile:/C:/usr/apps/jboss/jboss-5.1.0.Beta1/server/default/deploy/ejb2-timer-service.xml, name=ejb2-timer-service.xml, version=?] (call timed out) 2009-03-04 14:28:57,562 DEBUG [org.jboss.remoting.InvokerRegistry] (ResourceContainer.invoker.daemon-547) decremented org.jboss.remoting.transport.local.LocalClientInvoker@77075d's count, current count 323 There are 547-223=324 threads in the dump, so it looks like every time the avail check times out a new thread is generated. In the log snippet above ResourceContainer.invoker.daemon-547 thread is not mentioned in the log anywhere before this message.
blocking_thread_dump.txt contains the output of the ServerInfo threaddump method
Could be related to RHQ-1720
server.log.zip is the entire server (and embedded console) log over the period when this problem was seen.
More logs of interest: 2009-03-04 13:57:49,062 DEBUG [org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler] (InventoryManager.availability-1) Call to org.rhq.plugins.jbossas5.ManagedComponentComponent.getAvailability() with args [] timed out. Interrupting the invocation thread... 2009-03-04 13:57:49,062 DEBUG [org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler] (InventoryManager.availability-1) Facet lock status for [Resource[id=-12, type=Queue, key=JMSDestination:Queue:queue3, name=queue3, version=?]], is-write-locked=[false], is-write-locked-by-current-thread=[false], read-locks=[0], waiting-for-lock-queue-size=[0] 2009-03-04 13:57:49,062 DEBUG [org.rhq.core.pc.inventory.AvailabilityExecutor] (InventoryManager.availability-1) Failed to collect availability on resource Resource[id=-12, type=Queue, key=JMSDestination:Queue:queue3, name=queue3, version=?] (call timed out) 2009-03-04 14:59:25,046 DEBUG [org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler] (InventoryManager.availability-1) Call to org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getAvailability() with args [] failed. java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.jboss.profileservice.spi.NoSuchDeploymentException: Managed deployment: vfsfile:/C:/usr/apps/jboss/jboss-5.1.0.Beta1/server/default/deploy/queue3-service.xml not found. at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:215) at java.util.concurrent.FutureTask.get(FutureTask.java:85) at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invokeInNewThreadWithLock(ResourceContainer.java:398) at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invoke(ResourceContainer.java:381) at $Proxy225.getAvailability(Unknown Source) at org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(AvailabilityExecutor.java:167) at org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(AvailabilityExecutor.java:210) at org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(AvailabilityExecutor.java:210) at org.rhq.core.pc.inventory.AvailabilityExecutor.call(AvailabilityExecutor.java:106) at org.rhq.core.pc.inventory.AvailabilityExecutor.run(AvailabilityExecutor.java:70) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:280) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:135) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:65) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:142) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:166) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) at java.lang.Thread.run(Thread.java:595) Caused by: java.lang.RuntimeException: org.jboss.profileservice.spi.NoSuchDeploymentException: Managed deployment: vfsfile:/C:/usr/apps/jboss/jboss-5.1.0.Beta1/server/default/deploy/queue3-service.xml not found. at org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getAvailability(ManagedDeploymentComponent.java:136) at sun.reflect.GeneratedMethodAccessor323.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:466) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269) at java.util.concurrent.FutureTask.run(FutureTask.java:123) ... 3 more Caused by: org.jboss.profileservice.spi.NoSuchDeploymentException: Managed deployment: vfsfile:/C:/usr/apps/jboss/jboss-5.1.0.Beta1/server/default/deploy/queue3-service.xml not found. at org.jboss.profileservice.management.ManagementViewImpl.getDeployment(ManagementViewImpl.java:710) at sun.reflect.GeneratedMethodAccessor316.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.jboss.aop.Dispatcher.invoke(Dispatcher.java:121) at org.jboss.aspects.remoting.AOPRemotingInvocationHandler.invoke(AOPRemotingInvocationHandler.java:82) at org.jboss.profileservice.remoting.ProfileServiceInvocationHandler.invoke(ProfileServiceInvocationHandler.java:99) at org.jboss.remoting.ServerInvoker.invoke(ServerInvoker.java:908) at org.jboss.remoting.transport.local.LocalClientInvoker.invoke(LocalClientInvoker.java:106) at org.jboss.remoting.Client.invoke(Client.java:1708) at org.jboss.remoting.Client.invoke(Client.java:612) at org.jboss.aspects.remoting.InvokeRemoteInterceptor.invoke(InvokeRemoteInterceptor.java:60) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102) at org.jboss.aspects.remoting.MergeMetaDataInterceptor.invoke(MergeMetaDataInterceptor.java:74) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102) at org.jboss.aspects.security.SecurityClientInterceptor.invoke(SecurityClientInterceptor.java:65) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102) at AOPProxy$1.getDeployment(AOPProxy$1.java) at org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getManagedDeployment(ManagedDeploymentComponent.java:338) at org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getAvailability(ManagedDeploymentComponent.java:133) at sun.reflect.GeneratedMethodAccessor323.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:466) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269) at java.util.concurrent.FutureTask.run(FutureTask.java:123) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) at java.lang.Thread.run(Thread.java:595) at org.jboss.aspects.remoting.InvokeRemoteInterceptor.invoke(InvokeRemoteInterceptor.java:72) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102) at org.jboss.aspects.remoting.MergeMetaDataInterceptor.invoke(MergeMetaDataInterceptor.java:74) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102) at org.jboss.aspects.security.SecurityClientInterceptor.invoke(SecurityClientInterceptor.java:65) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:102) at AOPProxy$1.getDeployment(AOPProxy$1.java) at org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getManagedDeployment(ManagedDeploymentComponent.java:338) at org.rhq.plugins.jbossas5.ManagedDeploymentComponent.getAvailability(ManagedDeploymentComponent.java:133) ... 9 more
Here's what I think. We put the invocation call on a thread pool. Its a thread pool created by Executors.newCachedThreadPool so it is essentially unbounded but all idle threads are killed after 60 seconds. If the JBossAS call blocks indefinitely, those threads in the pool will never exit. Now, we use futures to capture timed out invocations (that's what those debug messages are saying, "Call to ... timed out. Interrupting the invocation thread... "). We call future.cancel(true) (whichs sends an interrupt to the thread): } catch (java.util.concurrent.TimeoutException e) { ... LOG.debug("Call to " + methodName + " with args " + methodArgs+ " timed out. Interrupting the invocation thread..."); ... future.cancel(true); If that calling thread doesn't honor or listen for this interrupt, that thread will never die and the thread pool will grow unbounded. Our thread that is collecting availabilities continues on, but this thread calling the JBossAS API is still running and hung and there is nothing we can do about it, other than what we've already tried to do (that being, we interrupt it). There are a couple things to consider: 1) have a thread pool per plugin - this would allow each plugin to not have to share pools that may have gone haywire. This however, doesn't really solve anything - because this rogue plugin will still spawn threads rapidly and degrade the VM - regardless of what the other plugins do 2) we could bound the pool - thus prohibiting the rogue plugin from being able to spawn too many threads. But this means its possible that this one rogue plugin could use up all the threads in the pool, thus starving the other resources/plugins that need to do things (and thus breaking everthing in the process since no plugins will be able to do anything). 3) We could combine 1 and 2 - providing each plugin a bounded thread pool. But a) do we really know the appropriate high-water mark for the max number of threads to allow? To high and we have the same problem as now, too low and we've artificially crippled the plugins in case they need more. And if we have N plugins, that means we'll potentially have N * max-threads-allowed threads in the system - which defeats the purpose of trying to limit the number of threads in the system (its conceivable that we'll have 10 or more plugins in a system - if our max allowed threads is 100, that means we'll have up to 1000 threads in the system potentially. You've seen problem with about half of those in your thread dump. So I'm not sure what improvement this would have. In short, I do not see a really good and effective way to solve the problem of a rogue plugin deadlocking and causing threads to spawn wildly (short of fixing the plugin :).
BTW: i've already thought about adding thread checks to the VM Health Check thread: http://jira.rhq-project.org/browse/RHQ-1035?focusedCommentId=12834#action_12834 "3) (optional) implement a check for thread deadlocks - if a thread is deadlocked, reset agent " http://support.rhq-project.org/display/RHQ/Design-AgentVMHealthCheck "In the future, we could have this health check thread look for deadlocked threads and it could either a) attempt to kill those threads if possible or b) perform an agent shutdown/restart like it does for the critically low memory condition." I did not implement this because of the problems we see in this case. Even if we have detected deadlocked threads (or even just an abnormal number of threads), we aren't assured at being able to kill them (even if we attempt to interrupt them or shutdown the plugin container) because there is no guarantee the threads will kill themselves. I haven't investigated if the deprecated "Thread.stop()" method can be used, would be appropriate and would work. But that is an option - analogous to a "kill -9" when a simple "kill" doesn't work.
after reading mazz's thread, it doesn't look like there is a quick fix to this issue, and thus won't be released in time for 1.2 release, pushing out.
Push to 1.4
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1722 Imported an attachment (id=368627) Imported an attachment (id=368628)
Temporarily adding the keyword "SubBug" so we can be sure we have accounted for all the bugs. keyword: new = Tracking + FutureFeature + SubBug
making sure we're not missing any bugs in rhq_triage
mazz, which solution is preferential here?
(In reply to comment #12) > mazz, which solution is preferential here? Adding a thread-check-and-destroy mechanism to the VM Health Check thread. That thread already has the infrastructure in place to periodically check the health of the VM - all we have to do is add the code to actually check the threads in the VM and kill them if appropriate.
Asking mazz for an update on this. Is there something we can do here or should this potentially be closed?
I know Lukas recently looked at some threading issues - he didn't run into this I don't think. This problem may have gone away already. We could attempt to play games as my previous suggestion mentions, but dont' know how easy/useful/effective it would be: "Adding a thread-check-and-destroy mechanism to the VM Health Check thread. That thread already has the infrastructure in place to periodically check the health of the VM - all we have to do is add the code to actually check the threads in the VM and kill them if appropriate."