Bug 795835 - Outrageous number of threads created by agent
Summary: Outrageous number of threads created by agent
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: RHQ Project
Classification: Other
Component: Agent
Version: 4.3
Hardware: Unspecified
OS: Unspecified
urgent
unspecified
Target Milestone: ---
: JON 3.0.1
Assignee: RHQ Project Maintainer
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On: 794489
Blocks: jon310-sprint11, rhq44-sprint11
TreeView+ depends on / blocked
 
Reported: 2012-02-21 15:55 UTC by Lukas Krejci
Modified: 2012-02-27 22:14 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 794489
Environment:
Last Closed: 2012-02-27 22:14:41 UTC
Embargoed:


Attachments (Terms of Use)

Description Lukas Krejci 2012-02-21 15:55:00 UTC
+++ This bug was initially created as a clone of Bug #794489 +++

Created attachment 563772 [details]
thread dump

Description of problem:

After having the agent running for a couple of hours, it currently has 531 active threads and consumes 100% CPU all the time.

The problem seems to be the discovery of EmbeddedJMXServerDiscoveryComponent which fills the agent.log with errors.

Version-Release number of selected component (if applicable):
4.3.0-SNAPSHOT commit hash 11d405f

How reproducible:
always

Steps to Reproduce:
1. Start the agent
2. wait
  
Actual results:
agent eats up 100% CPU after a while, creates huge number of threads

Expected results:
normal operation

Additional info:

--- Additional comment from lkrejci on 2012-02-16 17:59:12 EST ---

Created attachment 563773 [details]
excerpt of agent.log

Adding an excerpt from the agent.log showing the errors being logged - notice the rapid frequency with which these are logged.

--- Additional comment from lkrejci on 2012-02-16 18:01:58 EST ---

It seems to me that this might be caused by the JVM resource under a Tomcat resource that is reported down. The tomcat instance runs as the same user as the agent and was started using bin/startup.sh.

--- Additional comment from lkrejci on 2012-02-16 18:52:01 EST ---

Additional info...

This only happens during the detailed discovery so it's not that bad as I originally thought because the CPU intensiveness of detailed discovery is a known problem.

But still. What do we need 576 threads for?

--- Additional comment from ccrouch on 2012-02-17 11:41:03 EST ---

We need to investigate what is going on here.

--- Additional comment from lkrejci on 2012-02-20 10:38:32 EST ---

This is caused by ResourceContext.getNativeProcess() method that tries to run discovery if the process info is not available - this is correct behavior because this way we are able to transparently span process restarts but poor concurrency handling in this method is causing it to run the discovery many more times than necessary (each such discovery spawning a thread on its own).

Note also that this excessive discovery invocation was introduced in May 2011 by fix to the bug 702691 & bug 700461. This is therefore present in both RHQ4 and JON3 codebases.

--- Additional comment from lkrejci on 2012-02-20 15:37:50 EST ---

In addition to non-optimal concurrency handling, the potential discovery happening inside getNativeProcess() would get called with improperly configured discovery context.

This bad discovery context was the actual cause of the super-high number of threads created.

The discovery context passed to the discovery method in the getNativeProcess() would pass the resource context of the current resource instead of the resource context of the parent resource. The EmbeddedJMXServerDiscoveryComponent then tried to called discoveryContext.getParentResourceContext().getNativeProcess(). This would call the getNativeProcess of the very same resource context as before. That would again spawn another discovery which would again call the getNativeProcess() on the very same resource context instead of the parent as it should, etc, ad infinitum. The chain gets only broken by the interruption of these threads due to their timeout.

--- Additional comment from lkrejci on 2012-02-21 07:54:36 EST ---

Commit 21ce96a implements the minimal changes needed to get rid of the high number of threads being created. Commit 7533cbc further optimizes the amount of needed discoveries (i.e. the number of threads and more importantly the amount of work).

master http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=7533cbccc0d91f2307b6ab858d119a335d42d1ed
Author: Lukas Krejci <lkrejci>
Date:   Tue Feb 21 13:43:19 2012 +0100

    [BZ 794489] - Minimize the number of executed discoveries by sharing
    the discovery results among sibling resources during getNativeProcess()
    call.

master http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=21ce96ac0da01f32f3c57acec608523176db2ceb
Author: Lukas Krejci <lkrejci>
Date:   Tue Feb 21 13:23:47 2012 +0100

    [BZ 794489] - make sure to use the parent resource context when creating
    the discovery context that should rediscover the same resource as
    the resource context represents.

--- Additional comment from lkrejci on 2012-02-21 08:00:02 EST ---

note also that the above fix not only reduces the number of threads to a bearable number but also actually makes the getNativeProcess() work correctly. The reason why this wasn't discovered before is because the testing on bug 702691 was done using the Apache plugin which wasn't affected by that error. Tomcat on the other hand is.

Steps to test:
1) Inventory a tomcat instance into RHQ
2) Stop the tomcat instance
3) run discovery -f on the corresponding agent

The number of threads shouldn't raise drastically during the detailed discovery.

--- Additional comment from lkrejci on 2012-02-21 10:53:58 EST ---

This commit fixes the unit tests that broke due to the changed signature of ResourceContext constructor.

This should have been part of the commit 21ce96ac0da01f32f3c57acec608523176db2ceb and should be cherry-picked over with it to any other branch if need be.

master http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=48262b1ffcbfc2b4e1e2c80bd0858ca54942aae4
Author: Lukas Krejci <lkrejci>
Date:   Tue Feb 21 16:51:15 2012 +0100

    [BZ 794489] - Fixing the unit tests to work with the changed signature of
    ResourceContext constructor.

Comment 1 Mike Foley 2012-02-27 15:48:08 UTC
triage to JON 3.1 mfoley,crouch


Note You need to log in before you can comment on or make changes to this bug.