Description of problem:
The code to collect metrics on memory utilization looks for "used" and "total" but does not consider "available". This results in the eventual climb to near 100% utilization in the graph/report/alert functions, when the system is perfectly healthy and simply has a high amount of cached pages.
Version-Release number of selected component (if applicable):
CFME 4.2 / OSP 9 (ceilometer)
Steps to Reproduce:
1. Enable metrics gathering
2. Observe utilization statistics
3. Compare with actual usage on the system
High memory utilization reports/alerts
Actual usage, which accounts for available memory.
Located in this file:
What CloudForms collects:
CPU_METERS = %w(hardware.cpu.util)
MEMORY_METERS = %w(hardware.memory.used
SWAP_METERS = %w(hardware.memory.swap.avail
DISK_METERS = %w(hardware.system_stats.io.outgoing.blocks
NETWORK_METERS = %w(hardware.network.ip.incoming.datagrams
What is actually used on the node:
[heat-admin@dh-rhosp-controller-1 ~]$ free -h
total used free shared buff/cache available
Mem: 125G 67G 9.4G 60M 48G 56G
Swap: 0B 0B 0B
Node sees 118/128 GB used
In the case of the OSP9 provider, I see there are snmp based metrics that are gathered, which could potentially be referenced by CFME to do the math and report on available memory.
hardware.memory.total Gauge KB host ID Pollster Total physical memory size
hardware.memory.used Gauge KB host ID Pollster Used physical memory size
hardware.memory.buffer Gauge KB host ID Pollster Physical memory buffer size
hardware.memory.cached Gauge KB host ID Pollster Cached physical memory size
Looking at something like this, as an example of what I had in mind.
# diff metrics_capture.rb-bkup-2017-02-22 metrics_capture.rb
< stats['hardware.memory.total'] > 0 ? 100.0 / stats['hardware.memory.total'] * stats['hardware.memory.used'] : 0
> stats['hardware.memory.total'] > 0 ? 100.0 / stats['hardware.memory.total'] * (stats['hardware.memory.used'] - stats['hardware.memory.cached']) : 0
Ladislav, any thoughts on the appropriate way to express what we're looking for from the available metrics?
I see, 'used' SNMP metric indeed provides buffers and cache as part of it, while these can be considered as free on linux machines.
@Mainn, Alex is providing a code snippets from:
@Alex seems like the hardware.memory.buffer can be also considered free? So it should be
stats['hardware.memory.total'] > 0 ? 100.0 / stats['hardware.memory.total'] * (stats['hardware.memory.used'] - stats['hardware.memory.cached'] - stats['hardware.memory.buffer']) : 0
My example was purely meant to illustrate my point. I'm not actually sure which values are being collected under those names. It was my assumption that the maintainer would see my point and determine which values to use.
When I saw "hardware.memory.buffer" I assumed that value would be the total amount of RAM installed. If it is actually another type of cache, I wouldn't know offhand if that particular chunk of memory is handled the same way that system cache is. I.E. cache is always "available" for use. If the buffer is memory space that is used by applications, it is *not* immediately "available" so it would be incorrect to remove that value from the total used.
You can check the actual SNMP oids here:
now in the free man page http://man7.org/linux/man-pages/man1/free.1.html
used is defined as: used == Used memory (calculated as total - free - buffers - cache)
Now I would assume that 188.8.131.52.4.1.2021.4.14.0 is the <Memory used by kernel buffers (Buffers in /proc/meminfo)> but I can't find it in SNMP docs, so not 100% sure
@Mainn can you investigate and then change the computation accordingly?
https://review.openstack.org/#/c/157257/ describes how used memory is calculated.
hardware.memory.used = total memory - total avail (free) memory
So as Alex observed, it would include cache memory. We will need to adjust how we calculate memory used in CloudForms.
Fix posted for review: https://github.com/ManageIQ/manageiq/pull/14470