Bug 794489

Summary: Outrageous number of threads created by agent
Product: [Other] RHQ Project Reporter: Lukas Krejci <lkrejci>
Component: AgentAssignee: Lukas Krejci <lkrejci>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: unspecified Docs Contact:
Priority: urgent    
Version: 4.3CC: ccrouch, hrupp
Target Milestone: ---   
Target Release: RHQ 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 795835 (view as bug list) Environment:
Last Closed: 2013-09-01 10:09:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 782579, 795835    
Attachments:
Description Flags
thread dump
none
excerpt of agent.log none

Description Lukas Krejci 2012-02-16 22:56:41 UTC
Created attachment 563772 [details]
thread dump

Description of problem:

After having the agent running for a couple of hours, it currently has 531 active threads and consumes 100% CPU all the time.

The problem seems to be the discovery of EmbeddedJMXServerDiscoveryComponent which fills the agent.log with errors.

Version-Release number of selected component (if applicable):
4.3.0-SNAPSHOT commit hash 11d405f

How reproducible:
always

Steps to Reproduce:
1. Start the agent
2. wait
  
Actual results:
agent eats up 100% CPU after a while, creates huge number of threads

Expected results:
normal operation

Additional info:

Comment 1 Lukas Krejci 2012-02-16 22:59:12 UTC
Created attachment 563773 [details]
excerpt of agent.log

Adding an excerpt from the agent.log showing the errors being logged - notice the rapid frequency with which these are logged.

Comment 2 Lukas Krejci 2012-02-16 23:01:58 UTC
It seems to me that this might be caused by the JVM resource under a Tomcat resource that is reported down. The tomcat instance runs as the same user as the agent and was started using bin/startup.sh.

Comment 3 Lukas Krejci 2012-02-16 23:52:01 UTC
Additional info...

This only happens during the detailed discovery so it's not that bad as I originally thought because the CPU intensiveness of detailed discovery is a known problem.

But still. What do we need 576 threads for?

Comment 4 Charles Crouch 2012-02-17 16:41:03 UTC
We need to investigate what is going on here.

Comment 5 Lukas Krejci 2012-02-20 15:38:32 UTC
This is caused by ResourceContext.getNativeProcess() method that tries to run discovery if the process info is not available - this is correct behavior because this way we are able to transparently span process restarts but poor concurrency handling in this method is causing it to run the discovery many more times than necessary (each such discovery spawning a thread on its own).

Note also that this excessive discovery invocation was introduced in May 2011 by fix to the bug 702691 & bug 700461. This is therefore present in both RHQ4 and RHQ3 codebases.

Comment 6 Lukas Krejci 2012-02-20 20:37:50 UTC
In addition to non-optimal concurrency handling, the potential discovery happening inside getNativeProcess() would get called with improperly configured discovery context.

This bad discovery context was the actual cause of the super-high number of threads created.

The discovery context passed to the discovery method in the getNativeProcess() would pass the resource context of the current resource instead of the resource context of the parent resource. The EmbeddedJMXServerDiscoveryComponent then tried to called discoveryContext.getParentResourceContext().getNativeProcess(). This would call the getNativeProcess of the very same resource context as before. That would again spawn another discovery which would again call the getNativeProcess() on the very same resource context instead of the parent as it should, etc, ad infinitum. The chain gets only broken by the interruption of these threads due to their timeout.

Comment 7 Lukas Krejci 2012-02-21 12:54:36 UTC
Commit 21ce96a implements the minimal changes needed to get rid of the high number of threads being created. Commit 7533cbc further optimizes the amount of needed discoveries (i.e. the number of threads and more importantly the amount of work).

master http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=7533cbccc0d91f2307b6ab858d119a335d42d1ed
Author: Lukas Krejci <lkrejci>
Date:   Tue Feb 21 13:43:19 2012 +0100

    [BZ 794489] - Minimize the number of executed discoveries by sharing
    the discovery results among sibling resources during getNativeProcess()
    call.

master http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=21ce96ac0da01f32f3c57acec608523176db2ceb
Author: Lukas Krejci <lkrejci>
Date:   Tue Feb 21 13:23:47 2012 +0100

    [BZ 794489] - make sure to use the parent resource context when creating
    the discovery context that should rediscover the same resource as
    the resource context represents.

Comment 8 Lukas Krejci 2012-02-21 13:00:02 UTC
note also that the above fix not only reduces the number of threads to a bearable number but also actually makes the getNativeProcess() work correctly. The reason why this wasn't discovered before is because the testing on bug 702691 was done using the Apache plugin which wasn't affected by that error. Tomcat on the other hand is.

Steps to test:
1) Inventory a tomcat instance into RHQ
2) Stop the tomcat instance
3) run discovery -f on the corresponding agent

The number of threads shouldn't raise drastically during the detailed discovery.

Comment 9 Lukas Krejci 2012-02-21 15:53:58 UTC
This commit fixes the unit tests that broke due to the changed signature of ResourceContext constructor.

This should have been part of the commit 21ce96ac0da01f32f3c57acec608523176db2ceb and should be cherry-picked over with it to any other branch if need be.

master http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=48262b1ffcbfc2b4e1e2c80bd0858ca54942aae4
Author: Lukas Krejci <lkrejci>
Date:   Tue Feb 21 16:51:15 2012 +0100

    [BZ 794489] - Fixing the unit tests to work with the changed signature of
    ResourceContext constructor.

Comment 10 Lukas Krejci 2012-02-22 14:41:07 UTC
Setting this back to assigned. We need to come up with some unit tests for this...

Comment 11 Lukas Krejci 2012-03-16 16:42:01 UTC
master http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h= c9c2b7419aa05a39a1dbdb933d292c0e4abf1ea8
Author: Lukas Krejci <lkrejci>
Date:   Fri Mar 16 17:36:26 2012 +0100

    [BZ 794489] - Fixed the corner case of incorrectly reported processinfo
    if there were no more processes found for a resource type.
    
    Added unit tests to check retrieval of native process info.

Comment 12 Lukas Krejci 2012-03-16 16:44:30 UTC
To test, do the following:

1) Inventory a Tomcat resource
2) Run full discovery
3) Check that the thread count of the RHQ agent process doesn't increase drastically during the discovery
4) Stop the tomcat process
5) Run full discovery again
6) Check the thread count during the discovery - no big increase (in hundreds) should occur.

Comment 13 Heiko W. Rupp 2013-09-01 10:09:30 UTC
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.