1987263 – fsSpaceFillingUpWarningThreshold not aligned to Kubernetes Garbage Collection Threshold

Bug 1987263 - fsSpaceFillingUpWarningThreshold not aligned to Kubernetes Garbage Collection Threshold

Summary: fsSpaceFillingUpWarningThreshold not aligned to Kubernetes Garbage Collection...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Arunprasad Rajkumar
QA Contact:	Junqi Zhao
Docs Contact:	Claire Bremble
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-29 11:12 UTC by Simon Reber
Modified:	2024-10-01 19:05 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, a false positive `NodeFilesystemSpaceFillingUp` alert was triggered when filesystem space was occupied by many Docker images. For this release, the threshold to fire the `NodeFilesystemSpaceFillingUp` warning alert is now reduced to 20% space available, instead of 40%, which stops the false positive alert from firing.
Clone Of:
Environment:
Last Closed:	2022-03-12 04:36:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1364	None	Merged	Bug 1987263: fsSpaceFillingUpWarningThreshold not aligned to Kubernetes Garbage Collection Threshold	2022-04-12 19:05:31 UTC
Github	prometheus-operator kube-prometheus pull 1357	None	Merged	Adjust node filesystem space filling up warning threshold to 20%	2022-04-12 19:05:29 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-12 04:37:05 UTC

Description Simon Reber 2021-07-29 11:12:24 UTC

Description of problem:

Based on https://github.com/prometheus-operator/kube-prometheus/issues/294, fsSpaceFillingUpCriticalThreshold was adjusted as it was in conflict with the Kubernetes Garbage Collection default values (nicely documented in https://www.openshift.com/blog/image-garbage-collection-in-openshift and https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/).

But with the default value for imageGCLowThresholdPercent, we are still able to trigger NodeFilesystemSpaceFillingUp on warning level as this is starting when 40% space is left.

Hence even with Garbage Collection running, the alert may continue to fire, even though necessary clean-up is done.

This is why fsSpaceFillingUpWarningThreshold should be adjusted to a reasonable value and if going to 20% is not possible, it would be required to talk to kubelet Engineering group to understand whether Garbage Collection default values need to be adjusted.

Version-Release number of selected component (if applicable):

- OpenShift Container Platform 4.x

How reproducible:

- Always

Steps to Reproduce:
1. N/A

Actual results:

NodeFilesystemSpaceFillingUp alert firing before Kubernetes Garbage Collection kicked in, causing sort of false/positive alert.

Expected results:

NodeFilesystemSpaceFillingUp to only kick in when kubernetes Garbage Collection has happen and the expected threshold could not be reached and thus manual activity may be required.

Additional info:

Comment 36 Junqi Zhao 2021-09-10 06:54:02 UTC

checked with 4.10.0-0.nightly-2021-09-09-225032
      - alert: NodeFilesystemSpaceFillingUp
        annotations:
          description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
            only {{ printf "%.2f" $value }}% available space left and is filling up.
          runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md
          summary: Filesystem is predicted to run out of space within the next 24 hours.
        expr: |
          (
            node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 20
          and
            predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
          and
            node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
          )
        for: 1h
        labels:
          severity: warning
      - alert: NodeFilesystemSpaceFillingUp
        annotations:
          description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has
            only {{ printf "%.2f" $value }}% available space left and is filling up fast.
          runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md
          summary: Filesystem is predicted to run out of space within the next 4 hours.
        expr: |
          (
            node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15
          and
            predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
          and
            node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
          )
        for: 1h
        labels:
          severity: critical

Comment 41 errata-xmlrpc 2022-03-12 04:36:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.