Bug 2077722

Summary: NodeFilesystemSpaceFillingUp alert fires even before kubelet GC kicks in
Product: OpenShift Container Platform Reporter: Arunprasad Rajkumar <arajkuma>
Component: MonitoringAssignee: Arunprasad Rajkumar <arajkuma>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: medium    
Version: 4.10CC: amuller, anpicker, aos-bugs, erooth, hongyli, juzhao, wking
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2074807 Environment:
Last Closed: 2022-05-23 13:25:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2074807    
Bug Blocks:    

Description Arunprasad Rajkumar 2022-04-22 02:44:35 UTC
+++ This bug was initially created as a clone of Bug #2074807 +++

Description of problem:

Previously[1] we attempted to do the same, but there was a
misunderstanding about the GC behavior and it caused the alert to be
fired even before GC comes into play.

According to[2][3] kubelet GC kicks in only when `imageGCHighThresholdPercent` is hit which is set to 85% by default. However `NodeFilesystemSpaceFillingUp` is set to fire as soon as 80% usage is hit.

[1] https://github.com/prometheus-operator/kube-prometheus/pull/1357
[2] https://docs.openshift.com/container-platform/4.10/nodes/nodes/nodes-nodes-garbage-collection.html#nodes-nodes-garbage-collection-images_nodes-nodes-configuring
[3] https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/ 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
NodeFilesystemSpaceFillingUp fires before kubelet GC kicks in

Expected results:

NodeFilesystemSpaceFillingUp shouldn't fire before kubelet GC kicks in


Additional info:

--- Additional comment from OpenShift Automated Release Tooling on 2022-04-22 03:25:45 IST ---

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release created.

Comment 2 Junqi Zhao 2022-05-12 03:04:35 UTC
tested with PR, the expr for NodeFilesystemSpaceFillingUp is updated to below:
        - alert: NodeFilesystemSpaceFillingUp
          annotations:
            description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
              only {{ printf "%.2f" $value }}% available space left and is filling up.
            runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md
            summary: Filesystem is predicted to run out of space within the next 24 hours.
          expr: |
            (
              node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15
            and
              predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
            and
              node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
            )
          for: 1h
          labels:
            severity: warning
        - alert: NodeFilesystemSpaceFillingUp
          annotations:
            description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
              only {{ printf "%.2f" $value }}% available space left and is filling up fast.
            runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md
            summary: Filesystem is predicted to run out of space within the next 4 hours.
          expr: |
            (
              node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 10
            and
              predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
            and
              node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
            )
          for: 1h
          labels:
            severity: critical

Comment 5 Junqi Zhao 2022-05-16 01:02:36 UTC
based on comment 2 and comment 4, set to VERIFIED

Comment 8 errata-xmlrpc 2022-05-23 13:25:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2258