Red Hat Bugzilla – Bug 805987
Platform plugin memory metrics are not representative of available memory
Last modified: 2013-09-01 06:11:15 EDT
Description of problem:
Different tools might report different numbers for memory usage of a system, and neither tool might be wrong. It just might be the case that the tools are reporting on different numbers. When I view the memory usage of a platform in RHQ the numbers are often times deceptively high. We report on the free and used memory which on Linux systems can be found in /proc/meminfo. For example, on my box,
bash-4.2$ cat /proc/meminfo
MemTotal: 16424944 kB
MemFree: 1092888 kB
Buffers: 590732 kB
Cached: 6969032 kB
According to these numbers I have about 16 GB of RAM and about 2 GB free or I am at about 92% memory utilization. This is what RHQ reports. According to the system monitor application though, I am only at about 45% memory utilization. The numbers we report are not really representative of the memory that is actually free or available because it fails to take into account the buffers and cached as see above. If a process needs more memory the kernel can and will allocate space from the cached or buffers. The system monitor app reports 45% because it factors in the buffers and cached into the equation.
Suppose my box was a production machine running some EAP servers, and I want to use RHQ to monitor overall system memory usage. If I see the 92% utilization, some panic is going to set it. What I really want to see though is the 45% utilization as reported by System Monitor.
Lastly, I have looked at the Sigar docs, and I don't think it exposes the cached and buffers data, but parsing /proc/meminfo to get the memory metrics would be easy enough.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Created attachment 572028 [details]
platform memory metrics
Here is a screenshot of the platform memory metrics reported by RHQ. The average free memory reported is about 1.23 GB. I have a total of 16 GB of RAM. That would mean I am about 92% memory utilization. While these numbers are not in and of themselves wrong, they are not representative of what's really going on. If I were actually at 92% memory utilization, my machine would be near worthless for development, but fortunately it's pretty snappy :)
Created attachment 572029 [details]
RHQ platform utilization report
Here is the platform utilization report which shows my system's overall memory usage at about 93.5%.
Created attachment 572030 [details]
Gnome System Monitor app
Here is a screenshot of System Monitor running on my box. Note that it reports about 47% memory usage which is in stark contrast to the 92% or 93% reported by RHQ.
Created attachment 572031 [details]
memory reported by htop
This screenshot shows memory usage reported by htop. It reports roughly 7 GB in use which works out to about 44% overall memory usage.
(12:32:29 PM) ccrouch: so we're reporting MemUsed=MemTotal-MemFree ?
(12:32:38 PM) jsanda: yeah
Interesting analysis John. I agree that nothing appears broken here, but we could be doing a better job of collecting more representative metrics.
My suggestion on a next step would be to raise an RFE on Sigar to add support for Buffers and Cached metrics. There may very well be Windows equivalents too we should be picking up. I really prefer to keep as much of our platform specific metrics going through Sigar for right now versus doing our own scanning of /proc/meminfo. The next step after that I think would be predicated on enhancements to the underlying alerts susbsystem, e.g. letting you compare relative size of two metrics.
Looks like this feature has been in Sigar already for some time. See https://jira.hyperic.com/browse/SIGAR-188. We are collecting metrics for Native.MemoryInfo.free and Native.MemoryInfo.used, but the more representative metrics are Native.MemoryInfo.actualFree and Native.MemoryInfo.actualUsed, both of which are available in the Mem class in the version of Sigar that we currently use.
I am not sure that I entirely understand the part of comparing the relative size of the two metrics. I propose the following. We collect both sets of metrics, and provide better, more accurate descriptions for the metrics. The description for the used memory metric is, "The total used system memory". That is simply is not accurate. And for the platform utilization report, I propose that we used the actualUsed metric.
As it stands right now, I don't see how anyone can reliably use the free and used memory metrics for alerting.
per BZ triage (crouch, loleary, asantos)
If this is a small amount of work we should try to add those metrics for rhq4.4
This is a small amount of work. I can definitely knock it out for RHQ 4.4.
The actual free and actual used metrics have been added to the platform plugins. The descriptions for the metrics have been updated as well to reflect which metric do and do not take into account caches and buffers. Lastly, the platform utilization report has been updated to use the new, more representative metrics for memory consumption.
master commit hash: 5420259201d92a13da1c24b752410a1c853ade46
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.