Bug 1425951

Summary: Memory utilization metrics fail to account for system cache
Product: Red Hat CloudForms Management Engine Reporter: Alex Mayberry <amayberr>
Component: C&U Capacity and UtilizationAssignee: Richard Su <rwsu>
Status: CLOSED CURRENTRELEASE QA Contact: Ido Ovadia <iovadia>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.7.0CC: amayberr, brant.evans, dscott, ikaur, iovadia, jhajyahy, jhardy, lsmola, maufart, obarenbo, rwsu, simaishi, tzumainn
Target Milestone: GAKeywords: TestOnly
Target Release: 5.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: c&u:openstack
Fixed In Version: 5.9.0.1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1444174 (view as bug list) Environment:
Last Closed: 2018-03-06 15:17:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Openstack Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1444174    

Description Alex Mayberry 2017-02-22 20:43:33 UTC
Description of problem:
The code to collect metrics on memory utilization looks for "used" and "total" but does not consider "available".  This results in the eventual climb to near 100% utilization in the graph/report/alert functions, when the system is perfectly healthy and simply has a high amount of cached pages.

Version-Release number of selected component (if applicable):
CFME 4.2 / OSP 9  (ceilometer)

How reproducible:
Always

Steps to Reproduce:
1. Enable metrics gathering
2. Observe utilization statistics
3. Compare with actual usage on the system

Actual results:
High memory utilization reports/alerts

Expected results:
Actual usage, which accounts for available memory.

Additional info:

Located in this file:

app/models/manageiq/providers/openstack/infra_manager/metrics_capture.rb

What CloudForms collects:



  CPU_METERS     = %w(hardware.cpu.util)
  MEMORY_METERS  = %w(hardware.memory.used
                      hardware.memory.total)
  SWAP_METERS    = %w(hardware.memory.swap.avail
                      hardware.memory.swap.total)
  DISK_METERS    = %w(hardware.system_stats.io.outgoing.blocks
                      hardware.system_stats.io.incoming.blocks)
  NETWORK_METERS = %w(hardware.network.ip.incoming.datagrams

What is actually used on the node:
[heat-admin@dh-rhosp-controller-1 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           125G         67G        9.4G         60M         48G         56G
Swap:            0B          0B          0B

Node sees 118/128 GB used

Comment 2 Alex Mayberry 2017-02-22 22:08:20 UTC
In the case of the OSP9 provider, I see there are snmp based metrics that are gathered, which could potentially be referenced by CFME to do the math and report on available memory.

https://docs.openstack.org/admin-guide/telemetry-measurements.html

hardware.memory.total	Gauge	KB	host ID	Pollster	Total physical memory size
hardware.memory.used	Gauge	KB	host ID	Pollster	Used physical memory size
hardware.memory.buffer	Gauge	KB	host ID	Pollster	Physical memory buffer size
hardware.memory.cached	Gauge	KB	host ID	Pollster	Cached physical memory size

Comment 3 Alex Mayberry 2017-02-23 16:40:41 UTC
Looking at something like this, as an example of what I had in mind.

# diff  metrics_capture.rb-bkup-2017-02-22 metrics_capture.rb
4c4,5
<                       hardware.memory.total)
---
>                       hardware.memory.total
>                       hardware.memory.cached)
24c25
<     stats['hardware.memory.total'] > 0 ? 100.0 / stats['hardware.memory.total'] * stats['hardware.memory.used'] : 0
---
>     stats['hardware.memory.total'] > 0 ? 100.0 / stats['hardware.memory.total'] * (stats['hardware.memory.used'] - stats['hardware.memory.cached']) : 0

Comment 4 Tzu-Mainn Chen 2017-03-06 17:55:42 UTC
Ladislav, any thoughts on the appropriate way to express what we're looking for from the available metrics?

Comment 5 Ladislav Smola 2017-03-07 09:46:19 UTC
I see, 'used' SNMP metric indeed provides buffers and cache as part of it, while these can be considered as free on linux machines.

@Mainn, Alex is providing a code snippets from:

https://github.com/Ladas/manageiq/blob/ac0c964897481ab42cabc947f2c2dcb803da2d35/app/models/manageiq/providers/openstack/infra_manager/metrics_capture.rb#L3-L3

and

https://github.com/Ladas/manageiq/blob/ac0c964897481ab42cabc947f2c2dcb803da2d35/app/models/manageiq/providers/openstack/infra_manager/metrics_capture.rb#L24


@Alex seems like the hardware.memory.buffer can be also considered free? So it should be

stats['hardware.memory.total'] > 0 ? 100.0 / stats['hardware.memory.total'] * (stats['hardware.memory.used'] - stats['hardware.memory.cached'] - stats['hardware.memory.buffer']) : 0

right?

Comment 6 Alex Mayberry 2017-03-07 15:08:05 UTC
My example was purely meant to illustrate my point.  I'm not actually sure which values are being collected under those names.  It was my assumption that the maintainer would see my point and determine which values to use.

When I saw "hardware.memory.buffer" I assumed that value would be the total amount of RAM installed.  If it is actually another type of cache, I wouldn't know offhand if that particular chunk of memory is handled the same way that system cache is.   I.E. cache is always "available" for use.  If the buffer is memory space that is used by applications, it is *not* immediately "available" so it would be incorrect to remove that value from the total used.

Comment 7 Ladislav Smola 2017-03-07 15:51:29 UTC
You can check the actual SNMP oids here:

https://github.com/openstack/ceilometer/blob/ffc9ee99c10ede988769907fdb0594a512c890cd/ceilometer/hardware/pollsters/data/snmp.yaml#L76

https://github.com/openstack/ceilometer/blob/ffc9ee99c10ede988769907fdb0594a512c890cd/ceilometer/hardware/pollsters/data/snmp.yaml#L101

https://github.com/openstack/ceilometer/blob/ffc9ee99c10ede988769907fdb0594a512c890cd/ceilometer/hardware/pollsters/data/snmp.yaml#L109


now in the free man page http://man7.org/linux/man-pages/man1/free.1.html

used is defined as: used == Used memory (calculated as total - free - buffers - cache)

Now I would assume that 1.3.6.1.4.1.2021.4.14.0 is the <Memory used by kernel buffers (Buffers in /proc/meminfo)> but I can't find it in SNMP docs, so not 100% sure

@Mainn can you investigate and then change the computation accordingly?

Comment 8 Richard Su 2017-03-13 22:56:50 UTC
https://review.openstack.org/#/c/157257/ describes how used memory is calculated.

hardware.memory.used = total memory - total avail (free) memory

So as Alex observed, it would include cache memory. We will need to adjust how we calculate memory used in CloudForms.

Comment 9 Richard Su 2017-03-27 21:44:16 UTC
Fix posted for review: https://github.com/ManageIQ/manageiq/pull/14470

Comment 13 Ido Ovadia 2018-02-26 17:05:52 UTC
Verified
========
5.9.0.22