Bug 1987263 - fsSpaceFillingUpWarningThreshold not aligned to Kubernetes Garbage Collection Threshold
Summary: fsSpaceFillingUpWarningThreshold not aligned to Kubernetes Garbage Collection...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: 4.10.0
Assignee: Arunprasad Rajkumar
QA Contact: Junqi Zhao
Claire Bremble
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-29 11:12 UTC by Simon Reber
Modified: 2022-04-12 19:05 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, a false positive `NodeFilesystemSpaceFillingUp` alert was triggered when filesystem space was occupied by many Docker images. For this release, the threshold to fire the `NodeFilesystemSpaceFillingUp` warning alert is now reduced to 20% space available, instead of 40%, which stops the false positive alert from firing.
Clone Of:
Environment:
Last Closed: 2022-03-12 04:36:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1364 0 None Merged Bug 1987263: fsSpaceFillingUpWarningThreshold not aligned to Kubernetes Garbage Collection Threshold 2022-04-12 19:05:31 UTC
Github prometheus-operator kube-prometheus pull 1357 0 None Merged Adjust node filesystem space filling up warning threshold to 20% 2022-04-12 19:05:29 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:37:05 UTC

Description Simon Reber 2021-07-29 11:12:24 UTC
Description of problem:

Based on https://github.com/prometheus-operator/kube-prometheus/issues/294, fsSpaceFillingUpCriticalThreshold was adjusted as it was in conflict with the Kubernetes Garbage Collection default values (nicely documented in https://www.openshift.com/blog/image-garbage-collection-in-openshift and https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/).

But with the default value for imageGCLowThresholdPercent, we are still able to trigger NodeFilesystemSpaceFillingUp on warning level as this is starting when 40% space is left.

Hence even with Garbage Collection running, the alert may continue to fire, even though necessary clean-up is done.

This is why fsSpaceFillingUpWarningThreshold should be adjusted to a reasonable value and if going to 20% is not possible, it would be required to talk to kubelet Engineering group to understand whether Garbage Collection default values need to be adjusted.

Version-Release number of selected component (if applicable):

- OpenShift Container Platform 4.x

How reproducible:

- Always

Steps to Reproduce:
1. N/A

Actual results:

NodeFilesystemSpaceFillingUp alert firing before Kubernetes Garbage Collection kicked in, causing sort of false/positive alert.

Expected results:

NodeFilesystemSpaceFillingUp to only kick in when kubernetes Garbage Collection has happen and the expected threshold could not be reached and thus manual activity may be required.

Additional info:

Comment 36 Junqi Zhao 2021-09-10 06:54:02 UTC
checked with 4.10.0-0.nightly-2021-09-09-225032
      - alert: NodeFilesystemSpaceFillingUp
        annotations:
          description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
            only {{ printf "%.2f" $value }}% available space left and is filling up.
          runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md
          summary: Filesystem is predicted to run out of space within the next 24 hours.
        expr: |
          (
            node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 20
          and
            predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
          and
            node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
          )
        for: 1h
        labels:
          severity: warning
      - alert: NodeFilesystemSpaceFillingUp
        annotations:
          description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
            only {{ printf "%.2f" $value }}% available space left and is filling up fast.
          runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md
          summary: Filesystem is predicted to run out of space within the next 4 hours.
        expr: |
          (
            node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15
          and
            predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
          and
            node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
          )
        for: 1h
        labels:
          severity: critical

Comment 41 errata-xmlrpc 2022-03-12 04:36:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.