1632762 – linear prediction provides slow alert removal for disk exhaustion

Bug 1632762 - linear prediction provides slow alert removal for disk exhaustion

Summary: linear prediction provides slow alert removal for disk exhaustion

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.11.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Frederic Branczyk
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-25 13:52 UTC by Justin Pierce
Modified:	2019-06-04 10:40 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:40:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
linear regression does not rapidly correct after space freed (12.53 KB, image/png) 2018-09-25 13:52 UTC, Justin Pierce	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:40:45 UTC

Description Justin Pierce 2018-09-25 13:52:02 UTC

Created attachment 1486747 [details]
linear regression does not rapidly correct after space freed

Description of problem:
The prometheus alert: predict_linear(node_filesystem_free{job="node-exporter",mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h],
  3600 * 24) < 0 and on(instance) up{job="node-exporter"}

performs linear regression over 6 days of data to determine if free space is trending downward to 0. After receiving this alert, SRE would be responsible for freeing space on the node. However, after this procedure is performed - and free space is available, a pure linear regression can take days to recover from the downward trend - weighed heavily by the past 6 days. See attached image for example. 

Version-Release number of selected component (if applicable):
v3.11.0-0.21.0

How reproducible:
100%

Steps to Reproduce:
1. Allow space on a monitored mount to be exhausted over the course of 6 days
2. Correct the issue by clearing space on the mount
3. Note that the prometheus alert is not cleared

Actual results:
The linear regression can still point negative for days. 

Expected results:
Clearing the mount should clear the alert in short order. I believe a sanity check for the for extrapolation could provide this.
Options:
1) Ensure that R^2 is > .8
2) linear_predict[6d] < 0 & current_free < avg(free[6d])

Comment 1 Justin Pierce 2018-09-26 21:13:04 UTC

I made an error in the original description. Alert actually describes 6h, not 6 days. Options described should still provide rapid 'all clear' for the alert.

Comment 2 Frederic Branczyk 2018-09-28 10:03:34 UTC

Not sure on which cluster this is, but we've done an improvement of this rule that I can see on free-stg and free-int, where the alert is immediately resolved as soon as the disk usage is below 85% again. Do you think that's sufficient?

Comment 3 Justin Pierce 2018-09-29 15:33:44 UTC

Initial BZ was authored when working on starter-ca-central-1 (v3.11.0-0.21.0). This cluster has now been upgraded to 3.11.16 (close to GA), so the new rules should be applied. I see the 85% filtering being applied for NodeDisk. 

>>>>
alert: NodeDiskRunningFull
expr: '(node:node_filesystem_usage:
  > 0.85) and (predict_linear(node:node_filesystem_avail:[30m], 3600 * 2) <
  0)'
<<<<

This is a significant improvement.

However, I see no such filter for KubePersistentVolumeFullInFourDays.

>>>>
alert: KubePersistentVolumeFullInFourDays
expr: kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}
  and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",namespace=~"(openshift.*|kube.*|default|logging)"}[6h],
  4 * 24 * 3600) < 0
<<<<

Adding a similar filter for KubePersistentVolumeFullInFourDays would probably satisfy the immediate need.


I would still recommend having something to accommodate low R^2 values where the linear prediction is not accurate, but this may not happen often in practice.

Comment 4 minden 2018-10-01 12:08:05 UTC

Thanks for the input Justin. Future work is tracked here: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/99.

Comment 7 minden 2018-10-04 10:01:15 UTC

Sure, we can prepare a PR for the release-3.11 branch.

Comment 8 Frederic Branczyk 2019-02-27 14:45:54 UTC

The kubernetes-mixin has had too many changes since this was merged for us to be able to pull in all the changes in 3.11. It is fixed in 4.0 so moving target release to that and marking as modified.

Comment 13 errata-xmlrpc 2019-06-04 10:40:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.