Bug 2077722 - NodeFilesystemSpaceFillingUp alert fires even before kubelet GC kicks in
Summary: NodeFilesystemSpaceFillingUp alert fires even before kubelet GC kicks in
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: All
OS: All
medium
low
Target Milestone: ---
: 4.10.z
Assignee: Arunprasad Rajkumar
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 2074807
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-22 02:44 UTC by Arunprasad Rajkumar
Modified: 2022-05-23 13:25 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2074807
Environment:
Last Closed: 2022-05-23 13:25:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1665 0 None open Bug 2077722: Adjust NodeFilesystemSpaceFillingUp thresholds according default kubelet GC behavior 2022-05-10 10:20:17 UTC
Github prometheus-operator kube-prometheus pull 1740 0 None open Adjust NodeFilesystemSpaceFillingUp thresholds according default kubelet GC behavior 2022-04-27 06:50:35 UTC
Red Hat Product Errata RHBA-2022:2258 0 None None None 2022-05-23 13:25:34 UTC

Description Arunprasad Rajkumar 2022-04-22 02:44:35 UTC
+++ This bug was initially created as a clone of Bug #2074807 +++

Description of problem:

Previously[1] we attempted to do the same, but there was a
misunderstanding about the GC behavior and it caused the alert to be
fired even before GC comes into play.

According to[2][3] kubelet GC kicks in only when `imageGCHighThresholdPercent` is hit which is set to 85% by default. However `NodeFilesystemSpaceFillingUp` is set to fire as soon as 80% usage is hit.

[1] https://github.com/prometheus-operator/kube-prometheus/pull/1357
[2] https://docs.openshift.com/container-platform/4.10/nodes/nodes/nodes-nodes-garbage-collection.html#nodes-nodes-garbage-collection-images_nodes-nodes-configuring
[3] https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/ 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
NodeFilesystemSpaceFillingUp fires before kubelet GC kicks in

Expected results:

NodeFilesystemSpaceFillingUp shouldn't fire before kubelet GC kicks in


Additional info:

--- Additional comment from OpenShift Automated Release Tooling on 2022-04-22 03:25:45 IST ---

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release created.

Comment 2 Junqi Zhao 2022-05-12 03:04:35 UTC
tested with PR, the expr for NodeFilesystemSpaceFillingUp is updated to below:
        - alert: NodeFilesystemSpaceFillingUp
          annotations:
            description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
              only {{ printf "%.2f" $value }}% available space left and is filling up.
            runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md
            summary: Filesystem is predicted to run out of space within the next 24 hours.
          expr: |
            (
              node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15
            and
              predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
            and
              node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
            )
          for: 1h
          labels:
            severity: warning
        - alert: NodeFilesystemSpaceFillingUp
          annotations:
            description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
              only {{ printf "%.2f" $value }}% available space left and is filling up fast.
            runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md
            summary: Filesystem is predicted to run out of space within the next 4 hours.
          expr: |
            (
              node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 10
            and
              predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
            and
              node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
            )
          for: 1h
          labels:
            severity: critical

Comment 5 Junqi Zhao 2022-05-16 01:02:36 UTC
based on comment 2 and comment 4, set to VERIFIED

Comment 8 errata-xmlrpc 2022-05-23 13:25:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2258


Note You need to log in before you can comment on or make changes to this bug.