Bug 2074807 - NodeFilesystemSpaceFillingUp alert fires even before kubelet GC kicks in
Summary: NodeFilesystemSpaceFillingUp alert fires even before kubelet GC kicks in
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: All
OS: All
medium
low
Target Milestone: ---
: 4.11.0
Assignee: Simon Pasquier
QA Contact: hongyan li
URL:
Whiteboard:
Depends On:
Blocks: 2077722
TreeView+ depends on / blocked
 
Reported: 2022-04-13 06:59 UTC by Arunprasad Rajkumar
Modified: 2022-08-10 11:07 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2077722 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:07:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1643 0 None open Bug 2074807: [bot] Update jsonnet dependencies 2022-04-21 18:06:50 UTC
Github prometheus-operator kube-prometheus pull 1729 0 None Merged Adjust NodeFilesystemSpaceFillingUp thresholds according default kubelet GC behavior 2022-04-21 18:06:50 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:07:26 UTC

Description Arunprasad Rajkumar 2022-04-13 06:59:49 UTC
Description of problem:

Previously[1] we attempted to do the same, but there was a
misunderstanding about the GC behavior and it caused the alert to be
fired even before GC comes into play.

According to[2][3] kubelet GC kicks in only when `imageGCHighThresholdPercent` is hit which is set to 85% by default. However `NodeFilesystemSpaceFillingUp` is set to fire as soon as 80% usage is hit.

[1] https://github.com/prometheus-operator/kube-prometheus/pull/1357
[2] https://docs.openshift.com/container-platform/4.10/nodes/nodes/nodes-nodes-garbage-collection.html#nodes-nodes-garbage-collection-images_nodes-nodes-configuring
[3] https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/ 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
NodeFilesystemSpaceFillingUp fires before kubelet GC kicks in

Expected results:

NodeFilesystemSpaceFillingUp shouldn't fire before kubelet GC kicks in


Additional info:

Comment 2 hongyan li 2022-04-22 08:19:57 UTC
Wait for pr is in payload

Comment 4 hongyan li 2022-04-24 02:17:34 UTC
Test with payload  4.11.0-0.nightly-2022-04-23-153426
% host=$(oc -n openshift-monitoring get route thanos-querier -ojsonpath={.spec.host})
% token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
% curl -H "Authorization: Bearer $token" -k "https://$host/api/v1/rules" | jq |grep -A10 NodeFilesystemSpaceFillingUp
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  252k    0  252k    0     0   140k      0 --:--:--  0:00:01 --:--:--  140k
            "name": "NodeFilesystemSpaceFillingUp",
            "query": "(node_filesystem_avail_bytes{fstype!=\"\",job=\"node-exporter\"} / node_filesystem_size_bytes{fstype!=\"\",job=\"node-exporter\"} * 100 < 10 and predict_linear(node_filesystem_avail_bytes{fstype!=\"\",job=\"node-exporter\"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!=\"\",job=\"node-exporter\"} == 0)",
            "duration": 3600,
            "labels": {
              "prometheus": "openshift-monitoring/k8s",
              "severity": "critical"
            },
            "annotations": {
              "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available space left and is filling up fast.",
              "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md",
              "summary": "Filesystem is predicted to run out of space within the next 4 hours."
            },
            "alerts": [],
            "health": "ok",
            "evaluationTime": 0.00215177,
            "lastEvaluation": "2022-04-24T02:15:36.216317682Z",
            "type": "alerting"
          },
          {
            "state": "inactive",
            "name": "NodeFilesystemSpaceFillingUp",
            "query": "(node_filesystem_avail_bytes{fstype!=\"\",job=\"node-exporter\"} / node_filesystem_size_bytes{fstype!=\"\",job=\"node-exporter\"} * 100 < 15 and predict_linear(node_filesystem_avail_bytes{fstype!=\"\",job=\"node-exporter\"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!=\"\",job=\"node-exporter\"} == 0)",
            "duration": 3600,
            "labels": {
              "prometheus": "openshift-monitoring/k8s",
              "severity": "warning"
            },
            "annotations": {
              "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available space left and is filling up.",
              "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeFilesystemSpaceFillingUp.md",
              "summary": "Filesystem is predicted to run out of space within the next 24 hours."
            },
            "alerts": [],
            "health": "ok",
            "evaluationTime": 0.002492956,
            "lastEvaluation": "2022-04-24T02:15:36.21382262Z",
            "type": "alerting"
          },
          {
            "state": "inactive",

Comment 9 errata-xmlrpc 2022-08-10 11:07:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.