Hide Forgot
Created attachment 1486747 [details] linear regression does not rapidly correct after space freed Description of problem: The prometheus alert: predict_linear(node_filesystem_free{job="node-exporter",mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h], 3600 * 24) < 0 and on(instance) up{job="node-exporter"} performs linear regression over 6 days of data to determine if free space is trending downward to 0. After receiving this alert, SRE would be responsible for freeing space on the node. However, after this procedure is performed - and free space is available, a pure linear regression can take days to recover from the downward trend - weighed heavily by the past 6 days. See attached image for example. Version-Release number of selected component (if applicable): v3.11.0-0.21.0 How reproducible: 100% Steps to Reproduce: 1. Allow space on a monitored mount to be exhausted over the course of 6 days 2. Correct the issue by clearing space on the mount 3. Note that the prometheus alert is not cleared Actual results: The linear regression can still point negative for days. Expected results: Clearing the mount should clear the alert in short order. I believe a sanity check for the for extrapolation could provide this. Options: 1) Ensure that R^2 is > .8 2) linear_predict[6d] < 0 & current_free < avg(free[6d])
I made an error in the original description. Alert actually describes 6h, not 6 days. Options described should still provide rapid 'all clear' for the alert.
Not sure on which cluster this is, but we've done an improvement of this rule that I can see on free-stg and free-int, where the alert is immediately resolved as soon as the disk usage is below 85% again. Do you think that's sufficient?
Initial BZ was authored when working on starter-ca-central-1 (v3.11.0-0.21.0). This cluster has now been upgraded to 3.11.16 (close to GA), so the new rules should be applied. I see the 85% filtering being applied for NodeDisk. >>>> alert: NodeDiskRunningFull expr: '(node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[30m], 3600 * 2) < 0)' <<<< This is a significant improvement. However, I see no such filter for KubePersistentVolumeFullInFourDays. >>>> alert: KubePersistentVolumeFullInFourDays expr: kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"} and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}[6h], 4 * 24 * 3600) < 0 <<<< Adding a similar filter for KubePersistentVolumeFullInFourDays would probably satisfy the immediate need. I would still recommend having something to accommodate low R^2 values where the linear prediction is not accurate, but this may not happen often in practice.
Thanks for the input Justin. Future work is tracked here: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/99.
Sure, we can prepare a PR for the release-3.11 branch.
The kubernetes-mixin has had too many changes since this was merged for us to be able to pull in all the changes in 3.11. It is fixed in 4.0 so moving target release to that and marking as modified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758