Bug 805987

Summary: Platform plugin memory metrics are not representative of available memory
Product: [Other] RHQ Project Reporter: John Sanda <jsanda>
Component: PluginsAssignee: John Sanda <jsanda>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: medium    
Version: 4.4CC: hrupp
Target Milestone: ---   
Target Release: RHQ 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 815979 (view as bug list) Environment:
Last Closed: 2013-09-01 06:11:15 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 782579, 815979    
Attachments:
Description Flags
platform memory metrics
none
RHQ platform utilization report
none
Gnome System Monitor app
none
memory reported by htop none

Description John Sanda 2012-03-22 11:56:31 EDT
Description of problem:
Different tools might report different numbers for memory usage of a system, and neither tool might be wrong. It just might be the case that the tools are reporting on different numbers. When I view the memory usage of a platform in RHQ the numbers are often times deceptively high. We report on the free and used memory which on Linux systems can be found in /proc/meminfo. For example, on my box,

bash-4.2$ cat /proc/meminfo
MemTotal:       16424944 kB
MemFree:         1092888 kB
Buffers:          590732 kB
Cached:          6969032 kB

According to these numbers I have about 16 GB of RAM and about 2 GB free or I am at about 92% memory utilization. This is what RHQ reports. According to the system monitor application though, I am only at about 45% memory utilization. The numbers we report are not really representative of the memory that is actually free or available because it fails to take into account the buffers and cached as see above. If a process needs more memory the kernel can and will allocate space from the cached or buffers. The system monitor app reports 45% because it factors in the buffers and cached into the equation.

Suppose my box was a production machine running some EAP servers, and I want to use RHQ to monitor overall system memory usage. If I see the 92% utilization, some panic is going to set it. What I really want to see though is the 45% utilization as reported by System Monitor.

Lastly, I have looked at the Sigar docs, and I don't think it exposes the cached and buffers data, but parsing /proc/meminfo to get the memory metrics would be easy enough.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 John Sanda 2012-03-22 12:01:36 EDT
Created attachment 572028 [details]
platform memory metrics

Here is a screenshot of the platform memory metrics reported by RHQ. The average free memory reported is about 1.23 GB. I have a total of 16 GB of RAM. That would mean I am about 92% memory utilization. While these numbers are not in and of themselves wrong, they are not representative of what's really going on. If I were actually at 92% memory utilization, my machine would be near worthless for development, but fortunately it's pretty snappy :)
Comment 2 John Sanda 2012-03-22 12:03:53 EDT
Created attachment 572029 [details]
RHQ platform utilization report

Here is the platform utilization report which shows my system's overall memory usage at about 93.5%.
Comment 3 John Sanda 2012-03-22 12:06:21 EDT
Created attachment 572030 [details]
Gnome System Monitor app

Here is a screenshot of System Monitor running on my box. Note that it reports about 47% memory usage which is in stark contrast to the 92% or 93% reported by RHQ.
Comment 4 John Sanda 2012-03-22 12:09:34 EDT
Created attachment 572031 [details]
memory reported by htop

This screenshot shows memory usage reported by htop. It reports roughly 7 GB in use which works out to about 44% overall memory usage.
Comment 5 Charles Crouch 2012-03-22 14:13:53 EDT
Note: 
(12:32:29 PM) ccrouch: so we're reporting MemUsed=MemTotal-MemFree ?
(12:32:38 PM) jsanda: yeah


Interesting analysis John. I agree that nothing appears broken here, but we could be doing a better job of collecting more representative metrics.

My suggestion on a next step would be to raise an RFE on Sigar to add support for Buffers and Cached metrics. There may very well be Windows equivalents too we should be picking up. I really prefer to keep as much of our platform specific metrics going through Sigar for right now versus doing our own scanning of /proc/meminfo. The next step after that I think would be predicated on enhancements to the underlying alerts susbsystem, e.g. letting you compare relative size of two metrics.
Comment 6 John Sanda 2012-03-22 15:29:44 EDT
Looks like this feature has been in Sigar already for some time. See https://jira.hyperic.com/browse/SIGAR-188. We are collecting metrics for Native.MemoryInfo.free and Native.MemoryInfo.used, but the more representative metrics are Native.MemoryInfo.actualFree and Native.MemoryInfo.actualUsed, both of which are available in the Mem class in the version of Sigar that we currently use.

I am not sure that I entirely understand the part of comparing the relative size of the two metrics. I propose the following. We collect both sets of metrics, and provide better, more accurate descriptions for the metrics. The description for the used memory metric is, "The total used system memory". That is simply is not accurate. And for the platform utilization report, I propose that we used the actualUsed metric.

As it stands right now, I don't see how anyone can reliably use the free and used memory metrics for alerting.
Comment 7 Mike Foley 2012-03-26 11:36:36 EDT
per BZ triage (crouch, loleary, asantos)
Comment 8 Charles Crouch 2012-03-26 12:02:11 EDT
If this is a small amount of work we should try to add those metrics for rhq4.4
Comment 9 John Sanda 2012-03-26 12:20:08 EDT
This is a small amount of work. I can definitely knock it out for RHQ 4.4.
Comment 10 John Sanda 2012-04-19 13:33:27 EDT
The actual free and actual used metrics have been added to the platform plugins. The descriptions for the metrics have been updated as well to reflect which metric do and do not take into account caches and buffers. Lastly, the platform utilization report has been updated to use the new, more representative metrics for memory consumption.

master commit hash: 5420259201d92a13da1c24b752410a1c853ade46
Comment 11 Heiko W. Rupp 2013-09-01 06:11:15 EDT
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.