Red Hat Bugzilla – Bug 788638
as-5 plugin consumes too much memory for managed components
Last modified: 2013-09-01 06:07:35 EDT
Description of problem:
Some of the resources from the AS 5 plugin can wind up consuming way too much memory that can lead of OutOfMemoryErrors on the agent. We first noticed this with EJB2Component. In my local testing environment using some simple, dummy EJBs, I have observed EJB2Components consuming 260 KB of heap space, and that is per component (i.e., per each individual EJB). In some real-world deployments, we have seen them consume as much as 1.5 MB.
After doing analysis with some heap dumps we have determined that the problem is not a memory leak.
Some of the managed objects that the AS 5 plugin gets back from the profile service have some very large properties. The managed object for the EJB2Component for example has a managed property named BeanMetaData that is really big. In my local environment, I have seen it use around 155 KB of heap space. Looking at some heap dumps of real world deployments I have seen the BeanMetaData property consume over 300 KB of heap space.
It turns out that we do not need or use all of the properties on the managed objects we get back from the profile service. We do not use the BeanMetaData property; so, we can filter out that property to reduce the foot print of each managed object.
There may be other large properties we do not use that we can filter out to further reduce the memory foot print. Also note that this issue may not me limited to EJB2 resources. We have observed that WebApplicationContextComponent uses even more heap space; however, there will likely be far fewer web components than EJB components in an application.
We are also investigating EJB3Component to determine whether or not it is holding onto any really large managed properties that we do not need/use.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Another big property, relatively speaking, for EJB2Component is one named URL. This is another one we do not use. In my local environment it is about 50 KB. To put it in perspective, if we can filter out the URL and BeanMetaData properties, we would reduce the foot print of the EJB2Component by about 205 KB (in my environment). That would be a reduction in heap usage of over 75%.
I need to make a slight correction in regards to my previous comment. I got the property name wrong. The correct property name is EJBModule. URL is a nested property inside of EJBModule. The size of 50 KB is correct though.
We have seen this issue with EJB 3 resources as well. In one heap dump I saw EJBComponents consuming over 9 MB of heap. The deployment property of the ManagedComponent corresponding to the EJB3Component consumes the bulk of that 9 MB.
This issue is not specific to EJB2 or EJB3 resources; however, it is most likely to manifest itself with them because an application will typically consist of a lot more EJB resources as opposed to WAR, EAR, data source, etc. types. The reason the problem is not specific to one or the other of these types is because EJB2Component, EJB3Component, and a number of other types inherit from a common base class - ManagedComponentComponent. It is ManagedComponentComponent where we cache the managed objects that we get back from the profile service.
The fix for this has been pushed to master.
master commit hashes:
There are two changes with these commits. First, ManagedComponentComponent no longer caches the ManagedComponent it gets back from the profile service. ManagedComponentComponent is the base class for most of the service types in the plugin including all EJB 2/3 resource types; so, this change effectively addresses the high memory usage issues describe in previous comments.
Secondly, ManagedComponentComponent.getAvailability has been refactored to avoid the high CPU utilization issues that were addressed in bug 647571. There is now a configurable interval that determines when the data used for availability checks needs to be refreshed, i.e., reloading the managed component from the profile service.That data, which is the RunState property of the ManagedComponent, is updated any time the ManagedComponent is retrieved from the profile service, like during metric collections. The interval is reset as well. The interval is configurable from the top-level application server resource. It has a new plugin configuration property named serviceAvailabilityRefreshInterval which defaults to 15 minutes.
For QE testing, we first want to make sure these changes have not introduced any regressions. We want to metric collections, availability checks, etc. still behave as expected. If QE wants to do any performance testing involving large numbers of EJBs, please see https://fedorahosted.org/pipermail/rhq-devel/2012-February/001481.html and feel free to contact me directly via email or IRC for additional help with getting set up.
QE might also want to test new plugin configuration property, serviceAvailabilityRefreshInterval, which belongs to the JBossAS Server resource type. The following could be done to test it. For some child types of JBossAS Server (e.g., EJB types, data sources, etc.). Disable metric collections for those types. Set serviceAvailabilityRefreshInterval to a really low value, like 1 minute. On your agent, enable debug logging on the class org.rhq.plugins.jbossas5.ManagedComponentComponent. When an availability check is done and the refresh interval is exceeded as it should be with an interval of 1 minute, you should see log messages of the form, "The availability refresh interval for [resourceKey: <resource_key>, type: <resource_type>, name: <component_name>] has been exceeded by....Reloading managed component."
This bug is still marked as ON_QA, but fixes appear to be included in RHQ 4.3 and 4.4. Did the status need to be flipped over, or am I mis-reading the git log and the fixes aren't included?
(In reply to comment #6)
> This bug is still marked as ON_QA, but fixes appear to be included in RHQ
> 4.3 and 4.4. Did the status need to be flipped over, or am I mis-reading the
> git log and the fixes aren't included?
The status is correct. This bug has been fixed in RHQ 4.3. It remains ON_QA until it is VERIFIED as fixed. The product version of this bug is Bug 790908 which has been VERIFIED in JON 3.0.1.
I think that the status just did not get updated. Moving to VERIFIED.
(In reply to comment #8)
> I think that the status just did not get updated. Moving to VERIFIED.
QE hasn't VERIFIED this yet. It should remain ON_QA until QE verifies.
I changed to VERIFIED on the basis that bug 790908 is a clone of this bug and has already been verified by QE. Since this fix has already been included in a community release and has been verified by QE on a product release branch, should I expect QE to go back and verify this?
For now I will move back to ON_QA until this gets sorted out.
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.