Bug 812968

Summary: High Agent CPU utilization after enabling certain Metric Collection Templates
Product: [Other] RHQ Project Reporter: Charles Crouch <ccrouch>
Component: AgentAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.0.0CC: ahovsepy, dvanbale, hbrock, hrupp, maurizio.antillon
Target Milestone: ---   
Target Release: RHQ 4.4.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 811696 Environment:
Last Closed: 2013-08-31 09:55:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 811696, 813917    
Bug Blocks: 782579    

Description Charles Crouch 2012-04-16 17:01:12 UTC
+++ This bug was initially created as a clone of Bug #811696 +++

Description of problem: If certain Metric Collection Templates are enabled for Tomcat Web Application (WAR) -- e.g. "Currently Active Sessions", "Processing Errors" or "Requests served" -- the RHQ agent will start displaying very high CPU utilization soon after. Running top on the agent's VM shows "Cpu(s)" at 51.7%us and the RHQ Agent process has a cpu value of 102-103% (even higher with more than one metric template enabled).

This appears to only apply to an agent that is monitoring a Tomcat/EWS server. An agent on a different server with no Tomcat/EWS instances didn't seem to have the same problem. It also doesn't seem to apply to the "Processing Errors per Minute" and "Requests served per Minute" templates, which I have enabled with a collection interval of 20 minutes and they don't seem to be causing problems.


Version-Release number of selected component (if applicable): JON 3.0.0.GA, JON Agent 4.2.0.JON300.GA, RHEL 5.5


How reproducible: Always


Steps to Reproduce:
1. Install/start JON server and an agent.
2. Install/start Tomcat/EWS (Tomcat 6)
3. Import Agent and Tomcat/EWS into JON server inventory
4. In the JON server UI, navigate to Administration->Metric Collection Templates->Tomcat Server->Tomcat Virtual Host->Tomcat Web Application (WAR)
5. Enable one of the problem metric templates (e.g. "Currently Active Sessions", "Processing Errors" or "Requests served"). Collection interval can be 10, 20 or 40 minutes, and should display the same results.
6. Navigate to the agent in the JON server UI and restart it.
7. Run top on the agent's server/VM
8. Within a minute or two, the agent's CPU utilization should increase substantially.
  
Actual results: RHQ Agent creates very high CPU load


Expected results: RHQ Agent should continue to create reasonable CPU load


Additional info: /proc/cpuinfo on the agent's VM reports two CPUs of type Intel Xeon X5690 @ 3.47GHz.

--- Additional comment from dvanbale on 2012-04-13 19:24:53 EDT ---

See the same results on a lenovo laptop with quad core i7 (/proc/cpuinfo shows four of the following: Intel(R) Core(TM) i7 CPU       M 620  @ 2.67GHz).

--- Additional comment from ccrouch on 2012-04-16 12:39:27 EDT ---

Hi David
Thanks for the bug report. Can you supply some more info:

-Full version of EWS being monitored
-Java version running JON agent and Java version running EWS
-Can you attach a copy of the inventory.xml from underneath the JON agent install
-How long does the high CPU load last?

Comment 1 Jay Shaughnessy 2012-04-19 03:26:38 UTC
Relevant master commits

commit 3cdb62feb318ec26bac53b87bf32c90915b088f6
commit 98ab742c79b8f878b160269aa7e4607136f6a0dc
commit 80208e99755069d9b004ab2eb3bf8095cdb6b35a

The problem is actually independent of plugin or resource type.  It is a
general problem with enable/disable of metrics at the template level.
- The database is not compromised
- It occurs only when applying the changes to existing inventory
- It affects only running agents that are updated as a result of the
  changes.
- The server code leaks bad collection interval values, used to indicate
  various enable/disable scenarios, to the agent update.

This has been corrected.  The changes are limited to the server jar.
There are additions to the MeasurementScheduleRemote and Local, and
certain methods have been deprecated.  Existing CLI scripts should be
ok, and I've added better validation of intervals being set in that
way. Although, if they are using the deprecated methods they should move
to the new methods after the next upgrade.

Agents do not need to be updated. But, agents suffering from this problem
should be re-synced, or restarted --purge.

Test Notes:
To reproduce I used AS4 WAR type and, via the GUI, enabled an out-of-box
disabled metric to enabled, not changing the interval.

Again, the type is not relevant, there is a general issue with template metric
manipulation.  Various enable/disable and interval update should be tried from
the template level.  For good measure, group and resource level should be
sanity checked, although the code-path is different.

The Agent prompt command:

  > schedules <resource-id> 

is very useful for looking at the intervals defined for a resource.  You
should never see 0 or -1 in this list.

Comment 2 Mike Foley 2012-04-24 15:23:14 UTC
Countermeasure: Add TCMS Testcase per 4/24/2011 dev/support call.

https://engineering.redhat.com/trac/jon/ticket/116

Comment 3 Armine Hovsepyan 2012-09-05 13:37:06 UTC
The bug is verified. 

High cpu is not being create/kept for metrics collection changes, anyway, after a restart or just a start of agent it uses about 90% cpu for ~10-20 seconds and then it calms down till 0.4~ 1.3%. 


Additional task is created https://engineering.redhat.com/trac/jon/ticket/289

The task will serve for later investigation and performance testing.