Bug 1973147 - KubePersistentVolumeFillingUp - False Alert firing for PVCs with volumeMode as block.
Summary: KubePersistentVolumeFillingUp - False Alert firing for PVCs with volumeMode a...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.9.0
Assignee: Arunprasad Rajkumar
QA Contact: hongyan li
URL:
Whiteboard:
: 1986917 (view as bug list)
Depends On:
Blocks: 1977460 1984283 1984817
TreeView+ depends on / blocked
 
Reported: 2021-06-17 10:36 UTC by Martin Bukatovic
Modified: 2023-09-15 01:34 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1984283 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:34:57 UTC
Target Upstream Version:
Embargoed:
arajkuma: needinfo-
arajkuma: needinfo-


Attachments (Terms of Use)
screenshot #1: ocs dashboard with list of firing alerts, see KubePersistentVolumeFillingUp there (374.92 KB, image/png)
2021-06-17 10:43 UTC, Martin Bukatovic
no flags Details
screenshot of gp2 PVC with 0% free after creation (86.11 KB, image/png)
2021-06-25 11:08 UTC, Niels de Vos
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 636 0 None closed fix: KubePersistentVolumeFillingUp - false alert fires for PVCs with volumeMode as block 2021-07-08 06:49:00 UTC
Github openshift cluster-monitoring-operator pull 1269 0 None closed jsonnet: pull latest deps 2021-07-20 13:54:51 UTC
Red Hat Knowledge Base (Solution) 6240171 0 None None None 2021-08-04 20:04:48 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:35:22 UTC

Description Martin Bukatovic 2021-06-17 10:36:55 UTC
Description of problem
======================

Right after installation of internal LSO StorageCluster, I see that there are
12 KubePersistentVolumeFillingUp firing (one for each LSO device).

Version-Release number of selected component
============================================

OCP 4.8.0-0.nightly-2021-06-16-020345
LSO 4.8.0-202106102328
OCS 4.8.0-418.ci

How reproducible
================

4/4

Steps to Reproduce
==================

1. Install OCP on vSphere, with 3 master and 6 worker nodes, with 2 local
   storage device per worker node (for LSO).

2. Install LSO and OCS operators.

3. Use "Create Storage Cluster" wizard in OCP Console to start setup of Storage
   Cluster in "Internal - Attached devices" mode.

4. When installation of storage cluster finishes, check firing alerts.

Actual results
==============

In my case (with 12 LSO devices), I see 12 KubePersistentVolumeFillingUp alerts
firing (one for each local pv).

Looking into one of such alerts I see:

> KubePersistentVolumeFillingUp Critical
> The PersistentVolume claimed by ocs-deviceset-ocs-1-data-3ldm77 in Namespace openshift-storage is only 0% free.

See also screenshot #1.

Expected results
================

KubePersistentVolumeFillingUp are not firing.

Additional info
===============

This wasn't happening in previous OCS releases => flagging as a regression.

OCS is using the local PVs for OSDs, and it imho doesn't make sense to evaluate
storage utilization on this level.

See also must gather tarball attached in a comment below.

Alert expression is:

```
kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"} < 0.03
```

Comment 4 Martin Bukatovic 2021-06-17 10:43:42 UTC
Created attachment 1791781 [details]
screenshot #1: ocs dashboard with list of firing alerts, see KubePersistentVolumeFillingUp there

Comment 10 Niels de Vos 2021-06-25 11:08:50 UTC
Created attachment 1794347 [details]
screenshot of gp2 PVC with 0% free after creation

This is not only a false alert for LSO PVCs, also for PVCs created on gp2. Maybe the subject of this BZ can be adjusted?

And PVC with "volumeMode: Block" would be affected with this.

Comment 11 Mudit Agarwal 2021-06-28 12:14:42 UTC
Niels/Anmol, what are the next steps here? OCP 4.8 is already in Freeze stage and if this is a blocker, we might have to target this for OCP 4.8.1 so that it can be shipped before OCS4.8 releases.

Comment 12 Niels de Vos 2021-06-28 12:29:26 UTC
I think the approach should be to adjust the alerting rules. These alerts (KubePersistentVolumeFillingUp) should not get fired for `volumeMode: Block` volumes as usage/free can not be detected for them (both are set to 0, only capacity is valid).

Comment 18 hongyan li 2021-07-02 07:24:39 UTC
reproduced with comments #c15

Comment 20 Arunprasad Rajkumar 2021-07-08 06:51:05 UTC
Changes from upstream kubernetes-mixin to CMO will be synced through https://github.com/openshift/cluster-monitoring-operator/pull/1269

Comment 22 hongyan li 2021-07-16 02:32:58 UTC
Test with payload 4.9.0-0.nightly-2021-07-15-015134

Following #c15, no alerts KubePersistentVolumeFillingUp is triggered

Check alert rule changed
#oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml|grep KubePersistentVolumeFillingUp -A20
      - alert: KubePersistentVolumeFillingUp
        annotations:
          description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
            }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage
            }} free.
          summary: PersistentVolume is filling up.
        expr: |
          (
            kubelet_volume_stats_available_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"}
              /
            kubelet_volume_stats_capacity_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"}
          ) < 0.03
          and
          kubelet_volume_stats_used_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"} > 0
        for: 1m
        labels:
          severity: critical
      - alert: KubePersistentVolumeFillingUp
        annotations:
          description: Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim
            }} in Namespace {{ $labels.namespace }} is expected to fill up within four
            days. Currently {{ $value | humanizePercentage }} is available.
          summary: PersistentVolume is filling up.
        expr: |
          (
            kubelet_volume_stats_available_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"}
              /
            kubelet_volume_stats_capacity_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"}
          ) < 0.15
          and
          kubelet_volume_stats_used_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"} > 0
          and
          predict_linear(kubelet_volume_stats_available_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"}[6h], 4 * 24 * 3600) < 0
        for: 1h
        labels:
          severity: warning

Comment 30 Martin Bukatovic 2021-07-28 16:54:10 UTC
*** Bug 1986917 has been marked as a duplicate of this bug. ***

Comment 47 errata-xmlrpc 2021-10-18 17:34:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 52 Red Hat Bugzilla 2023-09-15 01:34:35 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.