2074807 – NodeFilesystemSpaceFillingUp alert fires even before kubelet GC kicks in

Bug 2074807 - NodeFilesystemSpaceFillingUp alert fires even before kubelet GC kicks in

Summary: NodeFilesystemSpaceFillingUp alert fires even before kubelet GC kicks in

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	All
OS:	All
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Simon Pasquier
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2077722
TreeView+	depends on / blocked

Reported:	2022-04-13 06:59 UTC by Arunprasad Rajkumar
Modified:	2022-08-10 11:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2077722 (view as bug list)
Environment:
Last Closed:	2022-08-10 11:07:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1643	None	open	Bug 2074807: [bot] Update jsonnet dependencies	2022-04-21 18:06:50 UTC
Github	prometheus-operator kube-prometheus pull 1729	None	Merged	Adjust NodeFilesystemSpaceFillingUp thresholds according default kubelet GC behavior	2022-04-21 18:06:50 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 11:07:26 UTC

Description Arunprasad Rajkumar 2022-04-13 06:59:49 UTC

Description of problem:

Previously[1] we attempted to do the same, but there was a
misunderstanding about the GC behavior and it caused the alert to be
fired even before GC comes into play.

According to[2][3] kubelet GC kicks in only when `imageGCHighThresholdPercent` is hit which is set to 85% by default. However `NodeFilesystemSpaceFillingUp` is set to fire as soon as 80% usage is hit.

[1] https://github.com/prometheus-operator/kube-prometheus/pull/1357
[2] https://docs.openshift.com/container-platform/4.10/nodes/nodes/nodes-nodes-garbage-collection.html#nodes-nodes-garbage-collection-images_nodes-nodes-configuring
[3] https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/ 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
NodeFilesystemSpaceFillingUp fires before kubelet GC kicks in

Expected results:

NodeFilesystemSpaceFillingUp shouldn't fire before kubelet GC kicks in


Additional info:

Comment 2 hongyan li 2022-04-22 08:19:57 UTC

Wait for pr is in payload

Comment 4 hongyan li 2022-04-24 02:17:34 UTC

Test with payload  4.11.0-0.nightly-2022-04-23-153426
% host=$(oc -n openshift-monitoring get route thanos-querier -ojsonpath={.spec.host})
% token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
% curl -H "Authorization: Bearer $token" -k "https://$host/api/v1/rules" | jq |grep -A10 NodeFilesystemSpaceFillingUp
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  252k    0  252k    0     0   140k      0 --:--:--  0:00:01 --:--:--  140k
            "name": "NodeFilesystemSpaceFillingUp",
            "query": "(node_filesystem_avail_bytes{fstype!=\"\",job=\"node-exporter\"} / node_filesystem_size_bytes{fstype!=\"\",job=\"node-exporter\"} * 100 < 10 and predict_linear(node_filesystem_avail_bytes{fstype!=\"\",job=\"node-exporter\"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!=\"\",job=\"node-exporter\"} == 0)",
            "duration": 3600,
            "labels": {
              "prometheus": "openshift-monitoring/k8s",
              "severity": "critical"
            },
            "annotations": {
              "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available space left and is filling up fast.",
              "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md",
              "summary": "Filesystem is predicted to run out of space within the next 4 hours."
            },
            "alerts": [],
            "health": "ok",
            "evaluationTime": 0.00215177,
            "lastEvaluation": "2022-04-24T02:15:36.216317682Z",
            "type": "alerting"
          },
          {
            "state": "inactive",
            "name": "NodeFilesystemSpaceFillingUp",
            "query": "(node_filesystem_avail_bytes{fstype!=\"\",job=\"node-exporter\"} / node_filesystem_size_bytes{fstype!=\"\",job=\"node-exporter\"} * 100 < 15 and predict_linear(node_filesystem_avail_bytes{fstype!=\"\",job=\"node-exporter\"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!=\"\",job=\"node-exporter\"} == 0)",
            "duration": 3600,
            "labels": {
              "prometheus": "openshift-monitoring/k8s",
              "severity": "warning"
            },
            "annotations": {
              "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available space left and is filling up.",
              "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md",
              "summary": "Filesystem is predicted to run out of space within the next 24 hours."
            },
            "alerts": [],
            "health": "ok",
            "evaluationTime": 0.002492956,
            "lastEvaluation": "2022-04-24T02:15:36.21382262Z",
            "type": "alerting"
          },
          {
            "state": "inactive",

Comment 9 errata-xmlrpc 2022-08-10 11:07:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.