Bug 805987 - Platform plugin memory metrics are not representative of available memory
Summary: Platform plugin memory metrics are not representative of available memory
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: RHQ Project
Classification: Other
Component: Plugins
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: RHQ 4.4.0
Assignee: John Sanda
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks: jon310-sprint11, rhq44-sprint11 815979
TreeView+ depends on / blocked
 
Reported: 2012-03-22 15:56 UTC by John Sanda
Modified: 2013-09-01 10:11 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 815979 (view as bug list)
Environment:
Last Closed: 2013-09-01 10:11:15 UTC
Embargoed:


Attachments (Terms of Use)
platform memory metrics (32.14 KB, image/png)
2012-03-22 16:01 UTC, John Sanda
no flags Details
RHQ platform utilization report (13.80 KB, image/png)
2012-03-22 16:03 UTC, John Sanda
no flags Details
Gnome System Monitor app (93.85 KB, image/png)
2012-03-22 16:06 UTC, John Sanda
no flags Details
memory reported by htop (56.52 KB, image/png)
2012-03-22 16:09 UTC, John Sanda
no flags Details

Description John Sanda 2012-03-22 15:56:31 UTC
Description of problem:
Different tools might report different numbers for memory usage of a system, and neither tool might be wrong. It just might be the case that the tools are reporting on different numbers. When I view the memory usage of a platform in RHQ the numbers are often times deceptively high. We report on the free and used memory which on Linux systems can be found in /proc/meminfo. For example, on my box,

bash-4.2$ cat /proc/meminfo
MemTotal:       16424944 kB
MemFree:         1092888 kB
Buffers:          590732 kB
Cached:          6969032 kB

According to these numbers I have about 16 GB of RAM and about 2 GB free or I am at about 92% memory utilization. This is what RHQ reports. According to the system monitor application though, I am only at about 45% memory utilization. The numbers we report are not really representative of the memory that is actually free or available because it fails to take into account the buffers and cached as see above. If a process needs more memory the kernel can and will allocate space from the cached or buffers. The system monitor app reports 45% because it factors in the buffers and cached into the equation.

Suppose my box was a production machine running some EAP servers, and I want to use RHQ to monitor overall system memory usage. If I see the 92% utilization, some panic is going to set it. What I really want to see though is the 45% utilization as reported by System Monitor.

Lastly, I have looked at the Sigar docs, and I don't think it exposes the cached and buffers data, but parsing /proc/meminfo to get the memory metrics would be easy enough.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 John Sanda 2012-03-22 16:01:36 UTC
Created attachment 572028 [details]
platform memory metrics

Here is a screenshot of the platform memory metrics reported by RHQ. The average free memory reported is about 1.23 GB. I have a total of 16 GB of RAM. That would mean I am about 92% memory utilization. While these numbers are not in and of themselves wrong, they are not representative of what's really going on. If I were actually at 92% memory utilization, my machine would be near worthless for development, but fortunately it's pretty snappy :)

Comment 2 John Sanda 2012-03-22 16:03:53 UTC
Created attachment 572029 [details]
RHQ platform utilization report

Here is the platform utilization report which shows my system's overall memory usage at about 93.5%.

Comment 3 John Sanda 2012-03-22 16:06:21 UTC
Created attachment 572030 [details]
Gnome System Monitor app

Here is a screenshot of System Monitor running on my box. Note that it reports about 47% memory usage which is in stark contrast to the 92% or 93% reported by RHQ.

Comment 4 John Sanda 2012-03-22 16:09:34 UTC
Created attachment 572031 [details]
memory reported by htop

This screenshot shows memory usage reported by htop. It reports roughly 7 GB in use which works out to about 44% overall memory usage.

Comment 5 Charles Crouch 2012-03-22 18:13:53 UTC
Note: 
(12:32:29 PM) ccrouch: so we're reporting MemUsed=MemTotal-MemFree ?
(12:32:38 PM) jsanda: yeah


Interesting analysis John. I agree that nothing appears broken here, but we could be doing a better job of collecting more representative metrics.

My suggestion on a next step would be to raise an RFE on Sigar to add support for Buffers and Cached metrics. There may very well be Windows equivalents too we should be picking up. I really prefer to keep as much of our platform specific metrics going through Sigar for right now versus doing our own scanning of /proc/meminfo. The next step after that I think would be predicated on enhancements to the underlying alerts susbsystem, e.g. letting you compare relative size of two metrics.

Comment 6 John Sanda 2012-03-22 19:29:44 UTC
Looks like this feature has been in Sigar already for some time. See https://jira.hyperic.com/browse/SIGAR-188. We are collecting metrics for Native.MemoryInfo.free and Native.MemoryInfo.used, but the more representative metrics are Native.MemoryInfo.actualFree and Native.MemoryInfo.actualUsed, both of which are available in the Mem class in the version of Sigar that we currently use.

I am not sure that I entirely understand the part of comparing the relative size of the two metrics. I propose the following. We collect both sets of metrics, and provide better, more accurate descriptions for the metrics. The description for the used memory metric is, "The total used system memory". That is simply is not accurate. And for the platform utilization report, I propose that we used the actualUsed metric.

As it stands right now, I don't see how anyone can reliably use the free and used memory metrics for alerting.

Comment 7 Mike Foley 2012-03-26 15:36:36 UTC
per BZ triage (crouch, loleary, asantos)

Comment 8 Charles Crouch 2012-03-26 16:02:11 UTC
If this is a small amount of work we should try to add those metrics for rhq4.4

Comment 9 John Sanda 2012-03-26 16:20:08 UTC
This is a small amount of work. I can definitely knock it out for RHQ 4.4.

Comment 10 John Sanda 2012-04-19 17:33:27 UTC
The actual free and actual used metrics have been added to the platform plugins. The descriptions for the metrics have been updated as well to reflect which metric do and do not take into account caches and buffers. Lastly, the platform utilization report has been updated to use the new, more representative metrics for memory consumption.

master commit hash: 5420259201d92a13da1c24b752410a1c853ade46

Comment 11 Heiko W. Rupp 2013-09-01 10:11:15 UTC
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.


Note You need to log in before you can comment on or make changes to this bug.