Description of problem: The code to collect metrics on memory utilization looks for "used" and "total" but does not consider "available". This results in the eventual climb to near 100% utilization in the graph/report/alert functions, when the system is perfectly healthy and simply has a high amount of cached pages. Version-Release number of selected component (if applicable): CFME 4.2 / OSP 9 (ceilometer) How reproducible: Always Steps to Reproduce: 1. Enable metrics gathering 2. Observe utilization statistics 3. Compare with actual usage on the system Actual results: High memory utilization reports/alerts Expected results: Actual usage, which accounts for available memory. Additional info: Located in this file: app/models/manageiq/providers/openstack/infra_manager/metrics_capture.rb What CloudForms collects: CPU_METERS = %w(hardware.cpu.util) MEMORY_METERS = %w(hardware.memory.used hardware.memory.total) SWAP_METERS = %w(hardware.memory.swap.avail hardware.memory.swap.total) DISK_METERS = %w(hardware.system_stats.io.outgoing.blocks hardware.system_stats.io.incoming.blocks) NETWORK_METERS = %w(hardware.network.ip.incoming.datagrams What is actually used on the node: [heat-admin@dh-rhosp-controller-1 ~]$ free -h total used free shared buff/cache available Mem: 125G 67G 9.4G 60M 48G 56G Swap: 0B 0B 0B Node sees 118/128 GB used
In the case of the OSP9 provider, I see there are snmp based metrics that are gathered, which could potentially be referenced by CFME to do the math and report on available memory. https://docs.openstack.org/admin-guide/telemetry-measurements.html hardware.memory.total Gauge KB host ID Pollster Total physical memory size hardware.memory.used Gauge KB host ID Pollster Used physical memory size hardware.memory.buffer Gauge KB host ID Pollster Physical memory buffer size hardware.memory.cached Gauge KB host ID Pollster Cached physical memory size
Looking at something like this, as an example of what I had in mind. # diff metrics_capture.rb-bkup-2017-02-22 metrics_capture.rb 4c4,5 < hardware.memory.total) --- > hardware.memory.total > hardware.memory.cached) 24c25 < stats['hardware.memory.total'] > 0 ? 100.0 / stats['hardware.memory.total'] * stats['hardware.memory.used'] : 0 --- > stats['hardware.memory.total'] > 0 ? 100.0 / stats['hardware.memory.total'] * (stats['hardware.memory.used'] - stats['hardware.memory.cached']) : 0
Ladislav, any thoughts on the appropriate way to express what we're looking for from the available metrics?
I see, 'used' SNMP metric indeed provides buffers and cache as part of it, while these can be considered as free on linux machines. @Mainn, Alex is providing a code snippets from: https://github.com/Ladas/manageiq/blob/ac0c964897481ab42cabc947f2c2dcb803da2d35/app/models/manageiq/providers/openstack/infra_manager/metrics_capture.rb#L3-L3 and https://github.com/Ladas/manageiq/blob/ac0c964897481ab42cabc947f2c2dcb803da2d35/app/models/manageiq/providers/openstack/infra_manager/metrics_capture.rb#L24 @Alex seems like the hardware.memory.buffer can be also considered free? So it should be stats['hardware.memory.total'] > 0 ? 100.0 / stats['hardware.memory.total'] * (stats['hardware.memory.used'] - stats['hardware.memory.cached'] - stats['hardware.memory.buffer']) : 0 right?
My example was purely meant to illustrate my point. I'm not actually sure which values are being collected under those names. It was my assumption that the maintainer would see my point and determine which values to use. When I saw "hardware.memory.buffer" I assumed that value would be the total amount of RAM installed. If it is actually another type of cache, I wouldn't know offhand if that particular chunk of memory is handled the same way that system cache is. I.E. cache is always "available" for use. If the buffer is memory space that is used by applications, it is *not* immediately "available" so it would be incorrect to remove that value from the total used.
You can check the actual SNMP oids here: https://github.com/openstack/ceilometer/blob/ffc9ee99c10ede988769907fdb0594a512c890cd/ceilometer/hardware/pollsters/data/snmp.yaml#L76 https://github.com/openstack/ceilometer/blob/ffc9ee99c10ede988769907fdb0594a512c890cd/ceilometer/hardware/pollsters/data/snmp.yaml#L101 https://github.com/openstack/ceilometer/blob/ffc9ee99c10ede988769907fdb0594a512c890cd/ceilometer/hardware/pollsters/data/snmp.yaml#L109 now in the free man page http://man7.org/linux/man-pages/man1/free.1.html used is defined as: used == Used memory (calculated as total - free - buffers - cache) Now I would assume that 1.3.6.1.4.1.2021.4.14.0 is the <Memory used by kernel buffers (Buffers in /proc/meminfo)> but I can't find it in SNMP docs, so not 100% sure @Mainn can you investigate and then change the computation accordingly?
https://review.openstack.org/#/c/157257/ describes how used memory is calculated. hardware.memory.used = total memory - total avail (free) memory So as Alex observed, it would include cache memory. We will need to adjust how we calculate memory used in CloudForms.
Fix posted for review: https://github.com/ManageIQ/manageiq/pull/14470
PR: https://github.com/ManageIQ/manageiq-providers-openstack/pull/17
Verified ======== 5.9.0.22