Red Hat Bugzilla – Bug 620596
enhancement: add "delay" to baseline calculation, to improve detection of problem metrics
Last modified: 2014-09-26 05:19:11 EDT
How things work today:
* out-of-bounds metrics / problem metrics are used to determine whether incoming metrics may be deviating from "normal" values
* "normal" values have historically been the corresponding measurement baseline data, which for all intents an purposes represents the *most recent* trailing average at the time it is calculated
Suggested incremental improvement:
* add a global / system-wide configuration parameter called "delay", which indicates that you want measurement baselines to represent a *slightly stale* (as opposed to most recent) trailing average for each metric
* this staleness will allow the out-of-bounds metrics to find "problems" sooner rather than later, because if a metric is deviating from it's "normal" values the delay parameter will make that deviation more prominent
* reset the system-wide default values for baseline calculation --> dataset, frequency, and delay become 7days, 24hrs, and 24hrs, respectively.
Background from Greg Hinkle:
The best example I can think of is a metric that is, on average, growing each day but growing more at peak times. In this example it is a dynamic metric, not trendsup (cummulative) and therefore is not expected to grow continuously. Our system would show a saw-tooth on the baseline factor and worse the oob would disappear entirely each time the baseline was calculated and may not show up again until late in the recalculation cycle. Giving it a delay period equal to the recalculation cycle would likely cause new incoming data to still be "out of bounds" rather than just barely greater than previous max values.
Since this system is designed to find expected steady-state metrics that are not behaving properly the delay would give you a better chance at being able to see them at any given time. Otherwise, you'd have to look at the OOB system at the right time of day for a given metric's baseline recalculation period.
Another way to see this issue is to do a basic install and inventory a whole server all in one shot (and nothing else). You'll see the list of OOBs (aka Problem Metrics) grow over the baseline recalc period and then get zeroe'd before the list starts growing again. This happens because the recalculation period for any metric starts when it is scheduled... and in this example all schedules would have the same initial time.
Like I said, this doesn't give us the TOD/DOW cycle comparison stuff that would be really nice, but I think it does increase the chance of their being useful data in the OOB system at any given time. Which, I must say, has surprisingly useful data at times. I've found interesting memory impact data when I turned on some event tracking and later saw the GC collections and timings shoot up quite a bit.
I would make it three configurable values (adding the delay to the other two existing settings) and then set the defaults as mentioned previously.
heiko - didn't you file a bug like this recently - what are your thoughts?
I think, while being a good request, we should not follow up on this further, as I think there are much better algorithms than our baselines (e.g. Holt-Winters), which we should look at for the future.
So I am closing this one.