2077722 – NodeFilesystemSpaceFillingUp alert fires even before kubelet GC kicks in

Bug 2077722 - NodeFilesystemSpaceFillingUp alert fires even before kubelet GC kicks in

Summary: NodeFilesystemSpaceFillingUp alert fires even before kubelet GC kicks in

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	All
OS:	All
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Arunprasad Rajkumar
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	2074807
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-22 02:44 UTC by Arunprasad Rajkumar
Modified:	2022-05-23 13:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2074807
Environment:
Last Closed:	2022-05-23 13:25:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1665	None	open	Bug 2077722: Adjust NodeFilesystemSpaceFillingUp thresholds according default kubelet GC behavior	2022-05-10 10:20:17 UTC
Github	prometheus-operator kube-prometheus pull 1740	None	open	Adjust NodeFilesystemSpaceFillingUp thresholds according default kubelet GC behavior	2022-04-27 06:50:35 UTC
Red Hat Product Errata	RHBA-2022:2258	None	None	None	2022-05-23 13:25:34 UTC

Description Arunprasad Rajkumar 2022-04-22 02:44:35 UTC

+++ This bug was initially created as a clone of Bug #2074807 +++

Description of problem:

Previously[1] we attempted to do the same, but there was a
misunderstanding about the GC behavior and it caused the alert to be
fired even before GC comes into play.

According to[2][3] kubelet GC kicks in only when `imageGCHighThresholdPercent` is hit which is set to 85% by default. However `NodeFilesystemSpaceFillingUp` is set to fire as soon as 80% usage is hit.

[1] https://github.com/prometheus-operator/kube-prometheus/pull/1357
[2] https://docs.openshift.com/container-platform/4.10/nodes/nodes/nodes-nodes-garbage-collection.html#nodes-nodes-garbage-collection-images_nodes-nodes-configuring
[3] https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/ 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
NodeFilesystemSpaceFillingUp fires before kubelet GC kicks in

Expected results:

NodeFilesystemSpaceFillingUp shouldn't fire before kubelet GC kicks in


Additional info:

--- Additional comment from OpenShift Automated Release Tooling on 2022-04-22 03:25:45 IST ---

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release created.

Comment 2 Junqi Zhao 2022-05-12 03:04:35 UTC

tested with PR, the expr for NodeFilesystemSpaceFillingUp is updated to below:
        - alert: NodeFilesystemSpaceFillingUp
          annotations:
            description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
              only {{ printf "%.2f" $value }}% available space left and is filling up.
            runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md
            summary: Filesystem is predicted to run out of space within the next 24 hours.
          expr: |
            (
              node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15
            and
              predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
            and
              node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
            )
          for: 1h
          labels:
            severity: warning
        - alert: NodeFilesystemSpaceFillingUp
          annotations:
            description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
              only {{ printf "%.2f" $value }}% available space left and is filling up fast.
            runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md
            summary: Filesystem is predicted to run out of space within the next 4 hours.
          expr: |
            (
              node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 10
            and
              predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
            and
              node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
            )
          for: 1h
          labels:
            severity: critical

Comment 5 Junqi Zhao 2022-05-16 01:02:36 UTC

based on comment 2 and comment 4, set to VERIFIED

Comment 8 errata-xmlrpc 2022-05-23 13:25:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2258

Note You need to log in before you can comment on or make changes to this bug.