Bug 1469243

Summary: [RFE] C&U rollups / NOR / Right-Size values don't accurately reflect realtime data
Product: Red Hat CloudForms Management Engine Reporter: Tasos Papaioannou <tpapaioa>
Component: C&U Capacity and UtilizationAssignee: Gregg Tanzillo <gtanzill>
Status: CLOSED WONTFIX QA Contact: Tasos Papaioannou <tpapaioa>
Severity: medium Docs Contact:
Priority: high    
Version: 5.8.0CC: bsorota, dajohnso, gtanzill, jhardy, lavenel, obarenbo, yrudman
Target Milestone: GAKeywords: FutureFeature
Target Release: cfme-future   
Hardware: All   
OS: All   
Whiteboard: c&u:NOR
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-18 02:04:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Realtime memory usage histogram none

Description Tasos Papaioannou 2017-07-10 17:23:35 UTC
Created attachment 1295899 [details]
Realtime memory usage histogram

Description of problem:

Hourly and daily rollups store the arithmetic mean of memory usage (mem_usage_absolute_average) and CPU usage (cpu_usagemhz_rate_average and cpu_usage_rate_average). The NOR high and low values are calculated as the mean +/- the sample standard deviation from the daily rollup averages.

These values aren't accurate measures of the hourly or daily usage, especially when the data aren't subject to a symmetric distribution. For example, the attached metrics-realtime-hist-20170703-20170710.png shows the distribution of a week's worth of realtime memory usage captured for a VM. The distribution is cut off by the minimum value of 0, so that the mean value (1.67) is skewed towards higher values than the median and mode (both 0.99).

The daily rollups have the following distribution for mem_usage_absolute_average:

daily avg min = 1.55
daily avg max = 1.72
daily avg avg = 1.65
daily avg stddev_samp = 0.05
low  = avg - stddev_samp = 1.60
high = avg + stddev_samp = 1.70

Compare these values to the percentiles calculated below from the realtime values:

mean median min 10%  20%   30%  40%  50%  60%  70%  80%  90%   max
1.67   0.99   0   0 0.99  0.99 0.99 0.99 1.99 1.99 2.99 2.99 14.99

median = 0.99
60th percentile = 1.99
70th percentile = 1.99
80th percentile = 2.99

The calculated 'low' value of 1.60 and the 'high' value of 1.70 are both between the 50th and 60th percentile of actual realtime memory usage. The 'conservative' right-size recommendations based on the 'high' value would actually be quite aggressive, bringing the available memory below the expected memory requirements >40% of the time. Similarly skewed estimates can be seen in calculations for CPU usage.

Instead of using the mean and standard deviation, something like the median (50th percentile) and other high/low percentile values (the 85th and 15th percentiles, for example) would be more representative of the actual usage.

Version-Release number of selected component (if applicable):

5.8.1.0.

How reproducible:

100%

Steps to Reproduce:
1.) Gather VM C&U data for several days.
2.) Compare realtime C&U to the NOR / Right-size data.

Actual results:

Avg/Max/High/Low values shown for NOR / Right-size do not reflect actual realtime usage values.

Expected results:

Avg/Max/High/Low values shown for NOR / Right-size reflect actual realtime usage values.

Additional info: