Description of problem:
The RHQ server has a slow memory leak related to EJB stateless session beans remaining in the pool and increasing in count slowly. Our Production RHQ server ran out of memory after 6 months of continuous operation. A heap dump and analysis with Memory Analyzer Toolkit indicated that the following Session Beans were consuming the following memory.
ServerManagerBean - 803700 Instances 475Mb
CloudManagerBean - 803243 Instances 462Mb
CacheConsistencyManagerBean 401829 instances 228Mb
StatusManagerBean - 401704 instances 228Mb
SystemManagementBean - 202163 instances 168Mb
The RHQ monitoring of the heap size of it's own JVM showed a straight line decline in Free Heap over the 6 months
Version-Release number of selected component (if applicable):
Run the RHQ server for 6 months
Steps to Reproduce:
1. Configure the RHQ server
2. Configure a number of agents
3. Leave running for a long period of time months
Free Heap decreases over time
Free Heap remains constant
Session Beans are configured to use the infinite pool. They should be changed to strict max pool
Sounds like a potentially easy fix, just need to investigate any other repurcussions from changing the pooling strategy.
The following forum post discusses this issue:
Based on that, it sounds like we might want to use the strict pool only for MDBs and EJB timers, since those are the only EJB methods that get called within unbounded thread pools. Presumably, in all other cases, we should stick with the infinite thread pool, since it will be more performant (we should verify this with Carlo).
https://issues.jboss.org/browse/EJBTHREE-1330 describes the issue with EJB timers and infinite thread pools. 3 out of 5 of the session beans Steve lists in the description as consuming the most heap have EJB timer methods (methods annotated with @Timeout). Specifically:
CacheConsistencyManagerBean.handleHeartbeatTimer() // called every 30s
ServerManagerBean.handleHeartbeatTimer() // called every 30s
SystemManagerBean.reloadConfigCache() // called every 60s
so just these three methods will result at 4+ new session bean instances getting created (and never destroyed) per minute. Additional session bean instances will get created by any other EJB calls these 3 methods make, and they do indeed call methods in CloudManagerBean and StatusManagerBean...
So I think annotating these 5 session bean classes with:
@org.jboss.annotation.ejb.PoolClass (value=org.jboss.ejb3.StrictMaxPool.class, maxSize=30, timeout=9223372036854775807L)
may solve 90% of the heap leakage.
Note, I have no idea what good values for maxSize and timeout would be (the values above are the defaults).
Fixed via [master 1ab97b2]:
On the 10 SLSB's containing one or methods that are invoked directly or indirectly by an MDB or EJB timer, use the PoolClass annotation to tell the EJB container to use the strict max pool, rather than the threadlocal pool, for those SLSB's (as described in my previous comment). I went with a max size of 60 in hopes of reducing the chances of calls on these SLSB's from blocking and the queue potentially getting backed up, but I plan to discuss the best values for the maxSize and timeout options with some other devs and possibly tweak the values.
I would suggest that the value become a configurable option with a default which should be ideal for most systems. The property should be exposed as a meaningful configuration option that could potentially be re-used in other places that need such a limit.
When the OOM occurs at several 100k of those beans, why not just up that limit to 1k instances?
We may also try to provide the pools settings via a jboss.xml file. From the jboss_4_2.dtd:
The container-pool-conf element holds configuration data for the
jboss does not read directly the subtree for this element: instead,
it is passed to the instance pool instance (if it implements
org.jboss.metadata.XmlLoadable) for it to load its parameters.
The default instance pools, EntityInstancePool and
StatelessSessionInstancePool, both accept the following configuration.
Used in: container-configuration
<!ELEMENT container-pool-conf (MinimumSize?, MaximumSize?,
Heiko, I'm pretty sure container-pool-conf is only for the EJB2. However, according to http://community.jboss.org/message/355782, it looks like there is a way to configure the pool class and its max size on a per bean basis for EJB3 via jboss.xml:
<?xml version="1.0" encoding="UTF-8"?>
I'll give this a try in the morning, since doing it this way, versus the annotations, would make it possible to tweak the pool settings without having to recompile from source.
For the record, the 10 SLSBs that need to use a strict pool, based on my analysis of a leaking heap dump, are as follows:
* alert related SLSBs called by alert cond consumer MDB:
AlertConditionLogManagerBean, AlertConditionManagerBean, AlertDampeningManagerBean, AlertDefinitionManagerBean, and CachedConditionManagerBean
(I set the max pool size to 100 for these)
* HA related SLSBs called by periodic EJB timers:
CacheConsistencyManagerBean, CloudManagerBean, ServerManagerBean, StatusManagerBean, and SystemManagerBean
(I set the max pool size to 50 for these)
[master d18f973] v2 of the SLSB leak fix, which configures the SLSBs to use the strict max pool via the jboss.xml deployment descriptor, rather than PoolClass annotations on the SLSB classes
i have tested this as follows: with RHQ profiled with YourKit, I observe the following:
1) initial number of total threads in RHQ = 49
2) defined an alert on RHQ Agent availability change
3) stopped and started the RHQ Agent a number of times
4) observed the alert has fired on availability state change
5) observed the total number of threads in the server remain stable at 49. (previously, this number increased unbounded resulting in resource exhaustion)
i sign-off on this change for a customer-specific patch (not to be included in RHQ 4.0.1)
attaching documentation of the verification. the attached image shows the thread count on the RHQ server as i cycle the RHQ Agent (which is triggering alerts).
Created attachment 498157 [details]
documenting the verification ... image shows thread count stable as i cycle the rhq-agent
marking this verified, as described above.
re-opening to perform additional tests, as follows:
1) configure 5 alerts to fire every 30 seconds. using platform resource, the alerts are defined as firing if free memory > 1.
2) confirm the alerts are firing.
3) attach profiler to RHQ server.
4) perform GC, and mark instance counts in heap. also record total number of classes, total number of theads, and total heap size.
5) let alerts fire for 30 minutes
6) perform GC.
7) record the total number of classes, total heap size, and total thread count at the end of the test.
8) look at changes in the heap ... and record the class names of the classes whose instance counts grew unbounded.
Documenting the results of the test defined above:
At a macro level ... the total memory and total instance count are fairly stable. At a more detailed level, I see 3 slsbs leaking:
I am documenting this by attaching a screenshot which shows the classes that are still growing in an unbounded manner as alerts fire.
I have discussed this with ips ... and he is going to check in some more changes, and I will retest.
Created attachment 498320 [details]
This file shows 3 slsbs still leaking.
Changes made by ips to address the 3 slsbs still leaking. I retested with latest build. Image attached shows the 3 slsbs (AlertManagerbean, AuthorizationManagerBean, and SubjectManagerBean ...as well as their associated interceptor classes) no longer leaking. This is fixed. Nice work ips!!
Created attachment 498513 [details]
Image of final verification.
Fix has been committed to the release-4.0.0 branch - commit 838d9ff.
Bookkeeping - closing bug - fixed in recent release.
*** Bug 676035 has been marked as a duplicate of this bug. ***