Bug 1469243 - C&U rollups / NOR / Right-Size values don't accurately reflect realtime data
C&U rollups / NOR / Right-Size values don't accurately reflect realtime data
Status: ON_DEV
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: C&U Capacity and Utilization (Show other bugs)
All All
high Severity medium
: GA
: cfme-future
Assigned To: Gregg Tanzillo
Tasos Papaioannou
Depends On:
  Show dependency treegraph
Reported: 2017-07-10 13:23 EDT by Tasos Papaioannou
Modified: 2018-03-08 02:14 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Realtime memory usage histogram (11.26 KB, image/png)
2017-07-10 13:23 EDT, Tasos Papaioannou
no flags Details

  None (edit)
Description Tasos Papaioannou 2017-07-10 13:23:35 EDT
Created attachment 1295899 [details]
Realtime memory usage histogram

Description of problem:

Hourly and daily rollups store the arithmetic mean of memory usage (mem_usage_absolute_average) and CPU usage (cpu_usagemhz_rate_average and cpu_usage_rate_average). The NOR high and low values are calculated as the mean +/- the sample standard deviation from the daily rollup averages.

These values aren't accurate measures of the hourly or daily usage, especially when the data aren't subject to a symmetric distribution. For example, the attached metrics-realtime-hist-20170703-20170710.png shows the distribution of a week's worth of realtime memory usage captured for a VM. The distribution is cut off by the minimum value of 0, so that the mean value (1.67) is skewed towards higher values than the median and mode (both 0.99).

The daily rollups have the following distribution for mem_usage_absolute_average:

daily avg min = 1.55
daily avg max = 1.72
daily avg avg = 1.65
daily avg stddev_samp = 0.05
low  = avg - stddev_samp = 1.60
high = avg + stddev_samp = 1.70

Compare these values to the percentiles calculated below from the realtime values:

mean median min 10%  20%   30%  40%  50%  60%  70%  80%  90%   max
1.67   0.99   0   0 0.99  0.99 0.99 0.99 1.99 1.99 2.99 2.99 14.99

median = 0.99
60th percentile = 1.99
70th percentile = 1.99
80th percentile = 2.99

The calculated 'low' value of 1.60 and the 'high' value of 1.70 are both between the 50th and 60th percentile of actual realtime memory usage. The 'conservative' right-size recommendations based on the 'high' value would actually be quite aggressive, bringing the available memory below the expected memory requirements >40% of the time. Similarly skewed estimates can be seen in calculations for CPU usage.

Instead of using the mean and standard deviation, something like the median (50th percentile) and other high/low percentile values (the 85th and 15th percentiles, for example) would be more representative of the actual usage.

Version-Release number of selected component (if applicable):

How reproducible:


Steps to Reproduce:
1.) Gather VM C&U data for several days.
2.) Compare realtime C&U to the NOR / Right-size data.

Actual results:

Avg/Max/High/Low values shown for NOR / Right-size do not reflect actual realtime usage values.

Expected results:

Avg/Max/High/Low values shown for NOR / Right-size reflect actual realtime usage values.

Additional info:

Note You need to log in before you can comment on or make changes to this bug.