Bug 1973147
| Summary: | KubePersistentVolumeFillingUp - False Alert firing for PVCs with volumeMode as block. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Martin Bukatovic <mbukatov> | ||||||
| Component: | Monitoring | Assignee: | Arunprasad Rajkumar <arajkuma> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | hongyan li <hongyli> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 4.8 | CC: | adeshpan, alchan, amuller, anpicker, ansaini, arajkuma, assingh, ddelcian, dholler, dmoessne, ebrizuel, erooth, etamir, hongyli, kgershon, mbarrett, mpandey, muagarwa, mzali, ndevos, nthomas, ocs-bugs, ssonigra, tdale | ||||||
| Target Milestone: | --- | Keywords: | Regression | ||||||
| Target Release: | 4.9.0 | Flags: | arajkuma:
needinfo-
arajkuma: needinfo- |
||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1984283 (view as bug list) | Environment: | |||||||
| Last Closed: | 2021-10-18 17:34:57 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1977460, 1984283, 1984817 | ||||||||
| Attachments: |
|
||||||||
Created attachment 1791781 [details]
screenshot #1: ocs dashboard with list of firing alerts, see KubePersistentVolumeFillingUp there
Created attachment 1794347 [details]
screenshot of gp2 PVC with 0% free after creation
This is not only a false alert for LSO PVCs, also for PVCs created on gp2. Maybe the subject of this BZ can be adjusted?
And PVC with "volumeMode: Block" would be affected with this.
Niels/Anmol, what are the next steps here? OCP 4.8 is already in Freeze stage and if this is a blocker, we might have to target this for OCP 4.8.1 so that it can be shipped before OCS4.8 releases. I think the approach should be to adjust the alerting rules. These alerts (KubePersistentVolumeFillingUp) should not get fired for `volumeMode: Block` volumes as usage/free can not be detected for them (both are set to 0, only capacity is valid). reproduced with comments #c15 Changes from upstream kubernetes-mixin to CMO will be synced through https://github.com/openshift/cluster-monitoring-operator/pull/1269 Test with payload 4.9.0-0.nightly-2021-07-15-015134
Following #c15, no alerts KubePersistentVolumeFillingUp is triggered
Check alert rule changed
#oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml|grep KubePersistentVolumeFillingUp -A20
- alert: KubePersistentVolumeFillingUp
annotations:
description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
}} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage
}} free.
summary: PersistentVolume is filling up.
expr: |
(
kubelet_volume_stats_available_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"}
/
kubelet_volume_stats_capacity_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"}
) < 0.03
and
kubelet_volume_stats_used_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"} > 0
for: 1m
labels:
severity: critical
- alert: KubePersistentVolumeFillingUp
annotations:
description: Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim
}} in Namespace {{ $labels.namespace }} is expected to fill up within four
days. Currently {{ $value | humanizePercentage }} is available.
summary: PersistentVolume is filling up.
expr: |
(
kubelet_volume_stats_available_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"}
/
kubelet_volume_stats_capacity_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"}
) < 0.15
and
kubelet_volume_stats_used_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"} > 0
and
predict_linear(kubelet_volume_stats_available_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"}[6h], 4 * 24 * 3600) < 0
for: 1h
labels:
severity: warning
*** Bug 1986917 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |
Description of problem ====================== Right after installation of internal LSO StorageCluster, I see that there are 12 KubePersistentVolumeFillingUp firing (one for each LSO device). Version-Release number of selected component ============================================ OCP 4.8.0-0.nightly-2021-06-16-020345 LSO 4.8.0-202106102328 OCS 4.8.0-418.ci How reproducible ================ 4/4 Steps to Reproduce ================== 1. Install OCP on vSphere, with 3 master and 6 worker nodes, with 2 local storage device per worker node (for LSO). 2. Install LSO and OCS operators. 3. Use "Create Storage Cluster" wizard in OCP Console to start setup of Storage Cluster in "Internal - Attached devices" mode. 4. When installation of storage cluster finishes, check firing alerts. Actual results ============== In my case (with 12 LSO devices), I see 12 KubePersistentVolumeFillingUp alerts firing (one for each local pv). Looking into one of such alerts I see: > KubePersistentVolumeFillingUp Critical > The PersistentVolume claimed by ocs-deviceset-ocs-1-data-3ldm77 in Namespace openshift-storage is only 0% free. See also screenshot #1. Expected results ================ KubePersistentVolumeFillingUp are not firing. Additional info =============== This wasn't happening in previous OCS releases => flagging as a regression. OCS is using the local PVs for OSDs, and it imho doesn't make sense to evaluate storage utilization on this level. See also must gather tarball attached in a comment below. Alert expression is: ``` kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"} < 0.03 ```