Bug 693232

Summary:

RHQ Server has a slow memory leak

Product:

[Other] RHQ Project

Reporter:

Steve Millidge <smillidge>

Component:

Core Server

Assignee:

Ian Springer <ian.springer>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Corey Welton <cwelton>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.0.1

CC:

ccrouch, hrupp, ian.springer, loleary, mazz, mfoley

Target Milestone:

---

Target Release:

JON 3.0.0

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

4.0

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

704339 (view as bug list)

Environment:

Last Closed:

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

678340, 703268, 704339, 751090

Attachments:

Description	Flags
documenting the verification ... image shows thread count stable as i cycle the rhq-agent	none
This file shows 3 slsbs still leaking.	none
Image of final verification.	none

Description Steve Millidge 2011-04-03 18:56:17 UTC

Description of problem:

The RHQ server has a slow memory leak related to EJB stateless session beans remaining in the pool and increasing in count slowly. Our Production RHQ server ran out of memory after 6 months of continuous operation. A heap dump and analysis with Memory Analyzer Toolkit indicated that the following Session Beans were consuming the following memory.

ServerManagerBean - 803700 Instances 475Mb 
CloudManagerBean - 803243  Instances 462Mb
CacheConsistencyManagerBean 401829 instances 228Mb
StatusManagerBean - 401704 instances 228Mb
SystemManagementBean - 202163 instances 168Mb

The RHQ monitoring of the heap size of it's own JVM showed a straight line decline in Free Heap over the 6 months

Version-Release number of selected component (if applicable):

3.0.0

How reproducible:

Run the RHQ server for 6 months

Steps to Reproduce:
1. Configure the RHQ server
2. Configure a number of agents
3. Leave running for a long period of time months
  
Actual results:
Free Heap decreases over time

Expected results:
Free Heap remains constant

Additional info:

Session Beans are configured to use the infinite pool. They should be changed to strict max pool

Comment 1 Charles Crouch 2011-04-04 00:21:17 UTC

Sounds like a potentially easy fix, just need to investigate any other repurcussions from changing the pooling strategy.

Comment 2 Ian Springer 2011-04-05 21:34:31 UTC

The following forum post discusses this issue:

http://community.jboss.org/message/586260

Based on that, it sounds like we might want to use the strict pool only for MDBs and EJB timers, since those are the only EJB methods that get called within unbounded thread pools. Presumably, in all other cases, we should stick with the infinite thread pool, since it will be more performant (we should verify this with Carlo).

Comment 3 Ian Springer 2011-04-05 22:12:34 UTC

https://issues.jboss.org/browse/EJBTHREE-1330 describes the issue with EJB timers and infinite thread pools. 3 out of 5 of the session beans Steve lists in the description as consuming the most heap have EJB timer methods (methods annotated with @Timeout). Specifically:

CacheConsistencyManagerBean.handleHeartbeatTimer()  // called every 30s
ServerManagerBean.handleHeartbeatTimer()            // called every 30s
SystemManagerBean.reloadConfigCache()               // called every 60s

so just these three methods will result at 4+ new session bean instances getting created (and never destroyed) per minute. Additional session bean instances will get created by any other EJB calls these 3 methods make, and they do indeed call methods in CloudManagerBean and StatusManagerBean...

So I think annotating these 5 session bean classes with:

@org.jboss.annotation.ejb.PoolClass (value=org.jboss.ejb3.StrictMaxPool.class, maxSize=30, timeout=9223372036854775807L)

may solve 90% of the heap leakage.

Note, I have no idea what good values for maxSize and timeout would be (the values above are the defaults).

Comment 4 Ian Springer 2011-05-06 16:26:15 UTC

Fixed via [master 1ab97b2]:

On the 10 SLSB's containing one or methods that are invoked directly or indirectly by an MDB or EJB timer, use the PoolClass annotation to tell the EJB container to use the strict max pool, rather than the threadlocal pool, for those SLSB's (as described in my previous comment). I went with a max size of 60 in hopes of reducing the chances of calls on these SLSB's from blocking and the queue potentially getting backed up, but I plan to discuss the best values for the maxSize and timeout options with some other devs and possibly tweak the values.

Comment 5 Larry O'Leary 2011-05-06 16:32:27 UTC

I would suggest that the value become a configurable option with a default which should be ideal for most systems. The property should be exposed as a meaningful configuration option that could potentially be re-used in other places that need such a limit.

Comment 6 Heiko W. Rupp 2011-05-09 08:29:32 UTC

When the OOM occurs at several 100k of those beans, why not just up that limit to 1k instances?

We may also try to provide the pools settings via a jboss.xml file. From the jboss_4_2.dtd:

<!--
  The container-pool-conf element holds configuration data for the
  instance pool.
  jboss does not read directly the subtree for this element: instead,
  it is passed to the instance pool instance (if it implements
  org.jboss.metadata.XmlLoadable) for it to load its parameters.

  The default instance pools, EntityInstancePool and
  StatelessSessionInstancePool, both accept the following configuration.

  Used in: container-configuration
-->
<!ELEMENT container-pool-conf (MinimumSize?, MaximumSize?,
   strictMaximumSize?, strictTimeout?)>

Comment 7 Ian Springer 2011-05-10 03:14:18 UTC

Heiko, I'm pretty sure container-pool-conf is only for the EJB2. However, according to http://community.jboss.org/message/355782, it looks like there is a way to configure the pool class and its max size on a per bean basis for EJB3 via jboss.xml:

<?xml version="1.0" encoding="UTF-8"?>
<jboss
 xmlns="http://java.sun.com/xml/ns/javaee"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://java.sun.com/xml/ns/javaee
 http://www.jboss.org/j2ee/schema/jboss_5_0.xsd"
 version="3.0">
 <security-domain>java:/jaas/example-domain</security-domain>
 <enterprise-beans>
 <message-driven>
 <ejb-name>sample.MDB</ejb-name>
 <pool-config>
 <pool-class>org.jboss.ejb3.StrictMaxPool</pool-class>
 <pool-max-size>1</pool-max-size>
 <pool-timeout>5000</pool-timeout>
 </pool-config>
 </message-driven>
 </enterprise-beans>
</jboss>

I'll give this a try in the morning, since doing it this way, versus the annotations, would make it possible to tweak the pool settings without having to recompile from source.

Comment 8 Ian Springer 2011-05-10 03:18:22 UTC

For the record, the 10 SLSBs that need to use a strict pool, based on my analysis of a leaking heap dump, are as follows:

* alert related SLSBs called by alert cond consumer MDB:
AlertConditionLogManagerBean, AlertConditionManagerBean, AlertDampeningManagerBean, AlertDefinitionManagerBean, and CachedConditionManagerBean

(I set the max pool size to 100 for these)

* HA related SLSBs called by periodic EJB timers:
CacheConsistencyManagerBean, CloudManagerBean, ServerManagerBean, StatusManagerBean, and SystemManagerBean

(I set the max pool size to 50 for these)

Comment 9 Ian Springer 2011-05-10 18:14:14 UTC

[master d18f973] v2 of the SLSB leak fix, which configures the SLSBs to use the strict max pool via the jboss.xml deployment descriptor, rather than PoolClass annotations on the SLSB classes

Comment 10 Mike Foley 2011-05-10 20:11:43 UTC

i have tested this as follows:  with RHQ profiled with YourKit, I observe the following:  
1) initial number of total threads in RHQ = 49
2) defined an alert on RHQ Agent availability change
3) stopped and started the RHQ Agent a number of times
4) observed the alert has fired on availability state change
5) observed the total number of threads in the server remain stable at 49. (previously, this number increased unbounded resulting in resource exhaustion)

i sign-off on this change for a customer-specific patch (not to be included in RHQ 4.0.1)

Comment 11 Mike Foley 2011-05-10 20:38:25 UTC

attaching documentation of the verification.  the attached image shows the thread count on the RHQ server as i cycle the RHQ Agent (which is triggering alerts).

Comment 12 Mike Foley 2011-05-10 20:41:02 UTC

Created attachment 498157 [details]
documenting the verification ... image shows thread count stable as i cycle the rhq-agent

Comment 13 Mike Foley 2011-05-10 20:42:23 UTC

marking this verified, as described above.

Comment 14 Mike Foley 2011-05-11 14:52:16 UTC

re-opening to perform additional tests, as follows:

1) configure 5 alerts to fire every 30 seconds.  using platform resource, the alerts are defined as firing if free memory > 1.
2) confirm the alerts are firing.
3) attach profiler to RHQ server.
4) perform GC, and mark instance counts in heap.  also record total number of classes, total number of theads, and total heap size.
5) let alerts fire for 30 minutes
6) perform GC.
7) record the total number of classes, total heap size, and total thread count at the end of the test.  
8) look at changes in the heap ... and record the class names of the classes whose instance counts grew unbounded.

Comment 15 Mike Foley 2011-05-11 15:20:10 UTC

Documenting the results of the test defined above:

At a macro level ... the total memory and total instance count are fairly stable.  At a more detailed level, I see 3 slsbs leaking:  

AlertManagerBean
AuthorizationManagerBean
SubjectManagerBean

I am documenting this by attaching a screenshot which shows the classes that are still growing in an unbounded manner as alerts fire.  

I have discussed this with ips ... and he is going to check in some more changes, and I will retest.

Comment 16 Mike Foley 2011-05-11 15:21:01 UTC

Created attachment 498320 [details]
This file shows 3 slsbs still leaking.

Comment 17 Mike Foley 2011-05-12 11:27:23 UTC

Changes made by ips to address the 3 slsbs still leaking.  I retested with latest build.  Image attached shows the 3 slsbs (AlertManagerbean, AuthorizationManagerBean, and SubjectManagerBean ...as well as their associated interceptor classes) no longer leaking.  This is fixed.  Nice work ips!!

Comment 18 Mike Foley 2011-05-12 11:27:58 UTC

Created attachment 498513 [details]
Image of final verification.

Comment 19 Ian Springer 2011-05-13 17:18:40 UTC

Fix has been committed to the release-4.0.0 branch - commit 838d9ff.

Comment 20 Corey Welton 2011-05-24 01:13:01 UTC

Bookkeeping - closing bug - fixed in recent release.

Comment 21 Corey Welton 2011-05-24 01:13:01 UTC

Bookkeeping - closing bug - fixed in recent release.

Comment 22 Corey Welton 2011-05-24 01:13:09 UTC

Bookkeeping - closing bug - fixed in recent release.

Comment 23 Ian Springer 2011-08-03 16:42:45 UTC

*** Bug 676035 has been marked as a duplicate of this bug. ***