Bug 815979 - Platform plugin memory metrics are not representative of available memory
Platform plugin memory metrics are not representative of available memory
Status: CLOSED CURRENTRELEASE
Product: JBoss Operations Network
Classification: JBoss
Component: Plugin -- Other (Show other bugs)
JON 3.0.1
Unspecified Unspecified
high Severity high
: ---
: JON 3.1.0
Assigned To: John Sanda
Mike Foley
:
Depends On: 805987
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-24 18:53 EDT by Larry O'Leary
Modified: 2013-09-05 14:50 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 805987
Environment:
Last Closed: 2013-09-05 14:50:28 EDT
Type: Enhancement
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 108083 None None None Never

  None (edit)
Description Larry O'Leary 2012-04-24 18:53:37 EDT
+++ This bug was initially created as a clone of upstream RHQ Project Bug #805987 +++

Description of problem:
Different tools might report different numbers for memory usage of a system, and neither tool might be wrong. It just might be the case that the tools are reporting on different numbers. When I view the memory usage of a platform in RHQ the numbers are often times deceptively high. We report on the free and used memory which on Linux systems can be found in /proc/meminfo. For example, on my box,

bash-4.2$ cat /proc/meminfo
MemTotal:       16424944 kB
MemFree:         1092888 kB
Buffers:          590732 kB
Cached:          6969032 kB

According to these numbers I have about 16 GB of RAM and about 2 GB free or I am at about 92% memory utilization. This is what RHQ reports. According to the system monitor application though, I am only at about 45% memory utilization. The numbers we report are not really representative of the memory that is actually free or available because it fails to take into account the buffers and cached as see above. If a process needs more memory the kernel can and will allocate space from the cached or buffers. The system monitor app reports 45% because it factors in the buffers and cached into the equation.

Suppose my box was a production machine running some EAP servers, and I want to use RHQ to monitor overall system memory usage. If I see the 92% utilization, some panic is going to set it. What I really want to see though is the 45% utilization as reported by System Monitor.

Lastly, I have looked at the Sigar docs, and I don't think it exposes the cached and buffers data, but parsing /proc/meminfo to get the memory metrics would be easy enough.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

--- Additional comment from jsanda@redhat.com on 2012-03-22 12:01:36 EDT ---

Created attachment 572028 [details]
platform memory metrics

Here is a screenshot of the platform memory metrics reported by RHQ. The average free memory reported is about 1.23 GB. I have a total of 16 GB of RAM. That would mean I am about 92% memory utilization. While these numbers are not in and of themselves wrong, they are not representative of what's really going on. If I were actually at 92% memory utilization, my machine would be near worthless for development, but fortunately it's pretty snappy :)

--- Additional comment from jsanda@redhat.com on 2012-03-22 12:03:53 EDT ---

Created attachment 572029 [details]
RHQ platform utilization report

Here is the platform utilization report which shows my system's overall memory usage at about 93.5%.

--- Additional comment from jsanda@redhat.com on 2012-03-22 12:06:21 EDT ---

Created attachment 572030 [details]
Gnome System Monitor app

Here is a screenshot of System Monitor running on my box. Note that it reports about 47% memory usage which is in stark contrast to the 92% or 93% reported by RHQ.

--- Additional comment from jsanda@redhat.com on 2012-03-22 12:09:34 EDT ---

Created attachment 572031 [details]
memory reported by htop

This screenshot shows memory usage reported by htop. It reports roughly 7 GB in use which works out to about 44% overall memory usage.

--- Additional comment from ccrouch@redhat.com on 2012-03-22 14:13:53 EDT ---

Note: 
(12:32:29 PM) ccrouch: so we're reporting MemUsed=MemTotal-MemFree ?
(12:32:38 PM) jsanda: yeah


Interesting analysis John. I agree that nothing appears broken here, but we could be doing a better job of collecting more representative metrics.

My suggestion on a next step would be to raise an RFE on Sigar to add support for Buffers and Cached metrics. There may very well be Windows equivalents too we should be picking up. I really prefer to keep as much of our platform specific metrics going through Sigar for right now versus doing our own scanning of /proc/meminfo. The next step after that I think would be predicated on enhancements to the underlying alerts susbsystem, e.g. letting you compare relative size of two metrics.

--- Additional comment from jsanda@redhat.com on 2012-03-22 15:29:44 EDT ---

Looks like this feature has been in Sigar already for some time. See https://jira.hyperic.com/browse/SIGAR-188. We are collecting metrics for Native.MemoryInfo.free and Native.MemoryInfo.used, but the more representative metrics are Native.MemoryInfo.actualFree and Native.MemoryInfo.actualUsed, both of which are available in the Mem class in the version of Sigar that we currently use.

I am not sure that I entirely understand the part of comparing the relative size of the two metrics. I propose the following. We collect both sets of metrics, and provide better, more accurate descriptions for the metrics. The description for the used memory metric is, "The total used system memory". That is simply is not accurate. And for the platform utilization report, I propose that we used the actualUsed metric.

As it stands right now, I don't see how anyone can reliably use the free and used memory metrics for alerting.

--- Additional comment from mfoley@redhat.com on 2012-03-26 11:36:36 EDT ---

per BZ triage (crouch, loleary, asantos)

--- Additional comment from ccrouch@redhat.com on 2012-03-26 12:02:11 EDT ---

If this is a small amount of work we should try to add those metrics for rhq4.4

--- Additional comment from jsanda@redhat.com on 2012-03-26 12:20:08 EDT ---

This is a small amount of work. I can definitely knock it out for RHQ 4.4.

--- Additional comment from jsanda@redhat.com on 2012-04-19 13:33:27 EDT ---

The actual free and actual used metrics have been added to the platform plugins. The descriptions for the metrics have been updated as well to reflect which metric do and do not take into account caches and buffers. Lastly, the platform utilization report has been updated to use the new, more representative metrics for memory consumption.

master commit hash: 5420259201d92a13da1c24b752410a1c853ade46
Comment 1 John Sanda 2012-05-08 09:35:07 EDT
I set this to ON_QA by accident thinking this bug was targeted for RHQ 4.4.0. Moving back to MODIFIED.
Comment 2 Charles Crouch 2012-05-14 16:52:16 EDT
Pushing to ON_QA, any recent JON build will have this fix
Comment 3 Mike Foley 2013-09-05 14:50:28 EDT
bugzilla clean up of old issues.

Note You need to log in before you can comment on or make changes to this bug.