Bug 1897674 - [Tracker for OCP bug 1903464 , bug 1907830] Critical PrometheusRuleFailures alert "Prometheus has failed to evaluate 10 rules in the last 5m" appears after installation
Summary: [Tracker for OCP bug 1903464 , bug 1907830] Critical PrometheusRuleFailures a...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: ---
Assignee: Anmol Sachan
QA Contact: Elad
URL:
Whiteboard:
Depends On: 1903464 1907830 1908566
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-13 18:10 UTC by Martin Bukatovic
Modified: 2024-06-13 23:25 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-07 13:57:19 UTC
Embargoed:


Attachments (Terms of Use)
screenshot #1: Cluster dashboard of 4.6 cluster on GCP right afer installation (163.15 KB, image/png)
2020-11-27 12:56 UTC, Martin Bukatovic
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-ci pull 4082 0 None closed Add PrometheusRuleFailures alert test 2021-04-06 09:47:22 UTC

Description Martin Bukatovic 2020-11-13 18:10:01 UTC
Description of problem
======================

When OCS is installed on OCP cluster, PrometheusRuleFailures alert appears,
noting that Prometheus has failed to evaluate 10 rules in the last 5m.

Version-Release number of selected component
============================================

OCP 4.7.0-0.nightly-2020-11-12-032522
OCS 4.7.0-158.ci

How reproducible
================

2/2

Steps to Reproduce
==================

1. Install OCP/OCS cluster
2. Open OCP Console and go to Home -> Overview Dahsboard

Actual results
==============

Following Prometheus alerts are shown:

- Prometheus openshift-monitoring/prometheus-k8s-1 has failed to evaluate 10 rules in the last 5m. (Critical)
- Prometheus openshift-monitoring/prometheus-k8s-0 has failed to evaluate 10 rules in the last 5m. (Critical)

Expected results
================

There are no Prometheus related alerts raised. Prometheus can load all rules
without any problems.

Additional info
===============

Defaults from ocs-ci we used to deploy the cluster on GCP.

The alert in question uses query:

```
increase(prometheus_rule_evaluation_failures_total{job=~"prometheus-k8s|prometheus-user-workload"}[5m]) > 0
```

When I fetch logs from prometheus pod of one of the prometheus pods:

```
$ oc logs pod/prometheus-k8s-0 -c prometheus -n openshift-monitoring > prometheus.log
```

I see that Prometheus complains about some OCS rule:

```
level=warn ts=2020-11-12T10:54:36.763Z caller=manager.go:598 component="rule manager" group=kubernetes.rules msg="Evaluating rule failed" rule="record: cluster:kubelet_volume_stats_used_bytes:provisioner:sum\nexpr: sum by(provisioner) (kubelet_volume_stats_used_bytes * on(namespace, persistentvolumeclaim) group_right() (kube_persistentvolumeclaim_info * on(storageclass) group_left(provisioner) kube_storageclass_info))\n" err="found duplicate series for the match group {namespace=\"openshift-image-registry\", persistentvolumeclaim=\"registry-cephfs-rwx-pvc\"} on the left hand-side of the operation: [{__name__=\"kubelet_volume_stats_used_bytes\", endpoint=\"https-metrics\", instance=\"10.0.32.4:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"openshift-image-registry\", node=\"mbukatov-1112a-bz-ctvx4-worker-d-vf95w.c.ocs4-283313.internal\", persistentvolumeclaim=\"registry-cephfs-rwx-pvc\", service=\"kubelet\"}, {__name__=\"kubelet_volume_stats_used_bytes\", endpoint=\"https-metrics\", instance=\"10.0.32.3:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"openshift-image-registry\", node=\"mbukatov-1112a-bz-ctvx4-worker-c-xd4fm.c.ocs4-283313.internal\", persistentvolumeclaim=\"registry-cephfs-rwx-pvc\", service=\"kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
```

This warning repeats in the log, and the nature of the problem suggests that
it's related to the alert.

Comment 2 Martin Bukatovic 2020-11-13 18:12:17 UTC
Full version report
===================

cluster channel: stable-4.7
cluster version: 4.7.0-0.nightly-2020-11-12-032522
cluster image: registry.svc.ci.openshift.org/ocp/release@sha256:612d1b2cf58677b07128490eb60c20ee5f0647fef9e3d087c73aded87af93216

storage namespace openshift-cluster-storage-operator
image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bfc29fc0584d1770bc965ba8de2d09c405322f6a40d3101bd0ca3703429d947
 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bfc29fc0584d1770bc965ba8de2d09c405322f6a40d3101bd0ca3703429d947
image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2b28e8ff5ee759e2706ef8b77b625b324927cc48d44396a2847ee8d038d900c0
 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2b28e8ff5ee759e2706ef8b77b625b324927cc48d44396a2847ee8d038d900c0
image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4ebe6809285deb32cd141ad871bf596fb9f85326e98e0d0e9ead3399cee03faa
 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4ebe6809285deb32cd141ad871bf596fb9f85326e98e0d0e9ead3399cee03faa

storage namespace openshift-kube-storage-version-migrator
image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:15a7d4226fc97eb8cdeed28302d4315b6cb16f131e010aefa7b0b52360745872
 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:15a7d4226fc97eb8cdeed28302d4315b6cb16f131e010aefa7b0b52360745872

storage namespace openshift-kube-storage-version-migrator-operator
image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:209074eeed356e9f77e0c5684ab92b62bed8526ac1177da099570d918977644c
 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:209074eeed356e9f77e0c5684ab92b62bed8526ac1177da099570d918977644c

storage namespace openshift-storage
image quay.io/rhceph-dev/cephcsi@sha256:f0818e50f378f7dd9a4c1ea417a21413ee092bbf602b52a3bdf6fd1a39adea7a
 * quay.io/rhceph-dev/cephcsi@sha256:c5d2737d5cd5b0ec2f48b649338cad0c5a5564b6d4420b198fc7f7c5518c07ab
image quay.io/rhceph-dev/ose-csi-node-driver-registrar@sha256:ef3e9e1eed457866b5beb45415dd389a47b68b0e1d40dff0a42a7ea7bf96157b
 * quay.io/rhceph-dev/ose-csi-node-driver-registrar@sha256:d6c9c01f82058e11615ff4e70fb9ded29e37925a917fe72092b1229be565e693
image quay.io/rhceph-dev/ose-csi-external-attacher@sha256:3aaf8beb8ecc26a71660de0959cbcfd701ab5133dbe7319b5d60746cd9a8e4c9
 * quay.io/rhceph-dev/ose-csi-external-attacher@sha256:3aaf8beb8ecc26a71660de0959cbcfd701ab5133dbe7319b5d60746cd9a8e4c9
image quay.io/rhceph-dev/ose-csi-external-provisioner@sha256:92ea53f9b409f31a02d5220ceb6fc86d945e9465f78fbb9bf3523056ac53463c
 * quay.io/rhceph-dev/ose-csi-external-provisioner@sha256:92ea53f9b409f31a02d5220ceb6fc86d945e9465f78fbb9bf3523056ac53463c
image quay.io/rhceph-dev/ose-csi-external-resizer@sha256:9621ec39c25f1eeb0a3f0f712b4a10b4f2d02cd32dbe493f2a60eea16868e811
 * quay.io/rhceph-dev/ose-csi-external-resizer@sha256:39686454eb334c004e40412b715858a4bce56c5b4efc861f24f98bdfd01d5e89
image quay.io/rhceph-dev/ose-csi-external-snapshotter@sha256:6e71727a0526328a258709b705ae5b5bab9d4d6ef357ccfa71882914a6d98295
 * quay.io/rhceph-dev/ose-csi-external-snapshotter@sha256:6e71727a0526328a258709b705ae5b5bab9d4d6ef357ccfa71882914a6d98295
image quay.io/rhceph-dev/mcg-core@sha256:4fd42e1593f660573102487f80bceefdca94a00e1ca2231ae1f812d5569e9f63
 * quay.io/rhceph-dev/mcg-core@sha256:4fd42e1593f660573102487f80bceefdca94a00e1ca2231ae1f812d5569e9f63
image registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:6abfa44b8b4d7b45d83b1158865194cb64481148701977167e900e5db4e1eba3
 * registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:6abfa44b8b4d7b45d83b1158865194cb64481148701977167e900e5db4e1eba3
image quay.io/rhceph-dev/mcg-operator@sha256:041a13deba6cc420c68b84bcc2fb38123dff9542f32935d07b5a1529a30171e8
 * quay.io/rhceph-dev/mcg-operator@sha256:041a13deba6cc420c68b84bcc2fb38123dff9542f32935d07b5a1529a30171e8
image quay.io/rhceph-dev/ocs-operator@sha256:2868c5a4409de690182379cb32a9237c354be4c4a0786dd1cc555864c063f698
 * quay.io/rhceph-dev/ocs-operator@sha256:2474cc057a01d913fb7ae0c9b1ff011cf073d90740ea749af794e414c278f208
image quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9
 * quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9
image quay.io/rhceph-dev/rook-ceph@sha256:684b92c0059955a5ed6e647654c36616c2555f1ede3e0d43a368067db00b392f
 * quay.io/rhceph-dev/rook-ceph@sha256:5b33a6dfc6021ea2d3c4ce8f2302e84a86e9d47698c277f02c9a1ecae780ed1e

Comment 6 Martin Bukatovic 2020-11-27 12:54:58 UTC
With OCP 4.6.0-0.nightly-2020-11-26-234822 and OCS 4.6.0-160.ci on GCP, I can see the same issue as well.

And I can confirm that I also see some issue with image registry. Besides already noted alerts:

- Critical PrometheusRuleFailures Prometheus openshift-monitoring/prometheus-k8s-0 has failed to evaluate 10 rules in the last 5m.
- Critical PrometheusRuleFailures Prometheus openshift-monitoring/prometheus-k8s-0 has failed to evaluate 10 rules in the last 5m.

I also see:

- Warning Image Registry Storage configuration has changed in the last 30 minutes. This change may have caused data loss.

But it was firing for a brief period of time after installation only, which may be why I haven't seen this during original report.

I need to check how it behaves if I disable prometheus reconfiguration which makes prometheus to store data on OCS (this code is part of ocs-ci).

Comment 7 Martin Bukatovic 2020-11-27 12:56:01 UTC
Created attachment 1734095 [details]
screenshot #1: Cluster dashboard of 4.6 cluster on GCP right afer installation

Comment 8 Martin Bukatovic 2020-11-27 17:50:23 UTC
(In reply to Martin Bukatovic from comment #6)
> I need to check how it behaves if I disable prometheus reconfiguration which
> makes prometheus to store data on OCS (this code is part of ocs-ci).

When I redeployed a cluster without using OCS for openshift-monitoring
(Prometheus and alertmanager) storage[1], I still noticed the same issue.

- OCP 4.6.0-0.nightly-2020-11-26-234822
- OCS 4.6.0-160.ci

[1] setting persistent-monitoring to false in ocs-ci https://github.com/red-hat-storage/ocs-ci/blob/0269048a15f9c86b7d41dce055ca87f5f77f8033/conf/examples/without_presistent_monitoring.yaml

Comment 11 Anmol Sachan 2020-11-30 14:21:04 UTC
The BZ has occurred because of a Rule for telemetry and has to be fixed in OCP. There is already a BZ for this. Thus, closing as duplicate fo https://bugzilla.redhat.com/show_bug.cgi?id=1879520 as it is already being worked upon.

*** This bug has been marked as a duplicate of bug 1879520 ***

Comment 18 Yaniv Kaul 2020-12-16 12:47:39 UTC
Moving to POST, as this BZ is tracking 1 BZ that is already in VERIFIED state and 1 that is in POST.

Comment 19 Nishanth Thomas 2021-02-04 06:32:18 UTC
Moving to ON_QA as the dependant bugs are already moved to 'VERIFIED'

Comment 20 Elad 2021-02-04 08:09:26 UTC
Tracked bugs 1903464 and 1907830 are now verified. Moving to VERIFIED


Note You need to log in before you can comment on or make changes to this bug.