Red Hat Bugzilla – Bug 1469243
C&U rollups / NOR / Right-Size values don't accurately reflect realtime data
Last modified: 2018-03-08 02:14:12 EST
Created attachment 1295899 [details]
Realtime memory usage histogram
Description of problem:
Hourly and daily rollups store the arithmetic mean of memory usage (mem_usage_absolute_average) and CPU usage (cpu_usagemhz_rate_average and cpu_usage_rate_average). The NOR high and low values are calculated as the mean +/- the sample standard deviation from the daily rollup averages.
These values aren't accurate measures of the hourly or daily usage, especially when the data aren't subject to a symmetric distribution. For example, the attached metrics-realtime-hist-20170703-20170710.png shows the distribution of a week's worth of realtime memory usage captured for a VM. The distribution is cut off by the minimum value of 0, so that the mean value (1.67) is skewed towards higher values than the median and mode (both 0.99).
The daily rollups have the following distribution for mem_usage_absolute_average:
daily avg min = 1.55
daily avg max = 1.72
daily avg avg = 1.65
daily avg stddev_samp = 0.05
low = avg - stddev_samp = 1.60
high = avg + stddev_samp = 1.70
Compare these values to the percentiles calculated below from the realtime values:
mean median min 10% 20% 30% 40% 50% 60% 70% 80% 90% max
1.67 0.99 0 0 0.99 0.99 0.99 0.99 1.99 1.99 2.99 2.99 14.99
median = 0.99
60th percentile = 1.99
70th percentile = 1.99
80th percentile = 2.99
The calculated 'low' value of 1.60 and the 'high' value of 1.70 are both between the 50th and 60th percentile of actual realtime memory usage. The 'conservative' right-size recommendations based on the 'high' value would actually be quite aggressive, bringing the available memory below the expected memory requirements >40% of the time. Similarly skewed estimates can be seen in calculations for CPU usage.
Instead of using the mean and standard deviation, something like the median (50th percentile) and other high/low percentile values (the 85th and 15th percentiles, for example) would be more representative of the actual usage.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.) Gather VM C&U data for several days.
2.) Compare realtime C&U to the NOR / Right-size data.
Avg/Max/High/Low values shown for NOR / Right-size do not reflect actual realtime usage values.
Avg/Max/High/Low values shown for NOR / Right-size reflect actual realtime usage values.