Description of problem ====================== Right after installation of internal LSO StorageCluster, I see that there are 12 KubePersistentVolumeFillingUp firing (one for each LSO device). Version-Release number of selected component ============================================ OCP 4.8.0-0.nightly-2021-06-16-020345 LSO 4.8.0-202106102328 OCS 4.8.0-418.ci How reproducible ================ 4/4 Steps to Reproduce ================== 1. Install OCP on vSphere, with 3 master and 6 worker nodes, with 2 local storage device per worker node (for LSO). 2. Install LSO and OCS operators. 3. Use "Create Storage Cluster" wizard in OCP Console to start setup of Storage Cluster in "Internal - Attached devices" mode. 4. When installation of storage cluster finishes, check firing alerts. Actual results ============== In my case (with 12 LSO devices), I see 12 KubePersistentVolumeFillingUp alerts firing (one for each local pv). Looking into one of such alerts I see: > KubePersistentVolumeFillingUp Critical > The PersistentVolume claimed by ocs-deviceset-ocs-1-data-3ldm77 in Namespace openshift-storage is only 0% free. See also screenshot #1. Expected results ================ KubePersistentVolumeFillingUp are not firing. Additional info =============== This wasn't happening in previous OCS releases => flagging as a regression. OCS is using the local PVs for OSDs, and it imho doesn't make sense to evaluate storage utilization on this level. See also must gather tarball attached in a comment below. Alert expression is: ``` kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"} < 0.03 ```
Created attachment 1791781 [details] screenshot #1: ocs dashboard with list of firing alerts, see KubePersistentVolumeFillingUp there
Created attachment 1794347 [details] screenshot of gp2 PVC with 0% free after creation This is not only a false alert for LSO PVCs, also for PVCs created on gp2. Maybe the subject of this BZ can be adjusted? And PVC with "volumeMode: Block" would be affected with this.
Niels/Anmol, what are the next steps here? OCP 4.8 is already in Freeze stage and if this is a blocker, we might have to target this for OCP 4.8.1 so that it can be shipped before OCS4.8 releases.
I think the approach should be to adjust the alerting rules. These alerts (KubePersistentVolumeFillingUp) should not get fired for `volumeMode: Block` volumes as usage/free can not be detected for them (both are set to 0, only capacity is valid).
reproduced with comments #c15
Changes from upstream kubernetes-mixin to CMO will be synced through https://github.com/openshift/cluster-monitoring-operator/pull/1269
Test with payload 4.9.0-0.nightly-2021-07-15-015134 Following #c15, no alerts KubePersistentVolumeFillingUp is triggered Check alert rule changed #oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml|grep KubePersistentVolumeFillingUp -A20 - alert: KubePersistentVolumeFillingUp annotations: description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. summary: PersistentVolume is filling up. expr: | ( kubelet_volume_stats_available_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"} / kubelet_volume_stats_capacity_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"} ) < 0.03 and kubelet_volume_stats_used_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"} > 0 for: 1m labels: severity: critical - alert: KubePersistentVolumeFillingUp annotations: description: Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value | humanizePercentage }} is available. summary: PersistentVolume is filling up. expr: | ( kubelet_volume_stats_available_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"} / kubelet_volume_stats_capacity_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"} ) < 0.15 and kubelet_volume_stats_used_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"} > 0 and predict_linear(kubelet_volume_stats_available_bytes{namespace=~"(openshift-.*|kube-.*|default)",job="kubelet", metrics_path="/metrics"}[6h], 4 * 24 * 3600) < 0 for: 1h labels: severity: warning
*** Bug 1986917 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days