Bug 1632762

Summary: linear prediction provides slow alert removal for disk exhaustion
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.11.0CC: scuppett
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:40:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
linear regression does not rapidly correct after space freed none

Description Justin Pierce 2018-09-25 13:52:02 UTC
Created attachment 1486747 [details]
linear regression does not rapidly correct after space freed

Description of problem:
The prometheus alert: predict_linear(node_filesystem_free{job="node-exporter",mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h],
  3600 * 24) < 0 and on(instance) up{job="node-exporter"}

performs linear regression over 6 days of data to determine if free space is trending downward to 0. After receiving this alert, SRE would be responsible for freeing space on the node. However, after this procedure is performed - and free space is available, a pure linear regression can take days to recover from the downward trend - weighed heavily by the past 6 days. See attached image for example. 

Version-Release number of selected component (if applicable):
v3.11.0-0.21.0

How reproducible:
100%

Steps to Reproduce:
1. Allow space on a monitored mount to be exhausted over the course of 6 days
2. Correct the issue by clearing space on the mount
3. Note that the prometheus alert is not cleared

Actual results:
The linear regression can still point negative for days. 

Expected results:
Clearing the mount should clear the alert in short order. I believe a sanity check for the for extrapolation could provide this.
Options:
1) Ensure that R^2 is > .8
2) linear_predict[6d] < 0 & current_free < avg(free[6d])

Comment 1 Justin Pierce 2018-09-26 21:13:04 UTC
I made an error in the original description. Alert actually describes 6h, not 6 days. Options described should still provide rapid 'all clear' for the alert.

Comment 2 Frederic Branczyk 2018-09-28 10:03:34 UTC
Not sure on which cluster this is, but we've done an improvement of this rule that I can see on free-stg and free-int, where the alert is immediately resolved as soon as the disk usage is below 85% again. Do you think that's sufficient?

Comment 3 Justin Pierce 2018-09-29 15:33:44 UTC
Initial BZ was authored when working on starter-ca-central-1 (v3.11.0-0.21.0). This cluster has now been upgraded to 3.11.16 (close to GA), so the new rules should be applied. I see the 85% filtering being applied for NodeDisk. 

>>>>
alert: NodeDiskRunningFull
expr: '(node:node_filesystem_usage:
  > 0.85) and (predict_linear(node:node_filesystem_avail:[30m], 3600 * 2) <
  0)'
<<<<

This is a significant improvement.

However, I see no such filter for KubePersistentVolumeFullInFourDays.

>>>>
alert: KubePersistentVolumeFullInFourDays
expr: kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}
  and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}[6h],
  4 * 24 * 3600) < 0
<<<<

Adding a similar filter for KubePersistentVolumeFullInFourDays would probably satisfy the immediate need.


I would still recommend having something to accommodate low R^2 values where the linear prediction is not accurate, but this may not happen often in practice.

Comment 4 minden 2018-10-01 12:08:05 UTC
Thanks for the input Justin. Future work is tracked here: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/99.

Comment 7 minden 2018-10-04 10:01:15 UTC
Sure, we can prepare a PR for the release-3.11 branch.

Comment 8 Frederic Branczyk 2019-02-27 14:45:54 UTC
The kubernetes-mixin has had too many changes since this was merged for us to be able to pull in all the changes in 3.11. It is fixed in 4.0 so moving target release to that and marking as modified.

Comment 13 errata-xmlrpc 2019-06-04 10:40:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758