Description of problem ====================== When OCS is installed on OCP cluster, PrometheusRuleFailures alert appears, noting that Prometheus has failed to evaluate 10 rules in the last 5m. Version-Release number of selected component ============================================ OCP 4.7.0-0.nightly-2020-11-12-032522 OCS 4.7.0-158.ci How reproducible ================ 2/2 Steps to Reproduce ================== 1. Install OCP/OCS cluster 2. Open OCP Console and go to Home -> Overview Dahsboard Actual results ============== Following Prometheus alerts are shown: - Prometheus openshift-monitoring/prometheus-k8s-1 has failed to evaluate 10 rules in the last 5m. (Critical) - Prometheus openshift-monitoring/prometheus-k8s-0 has failed to evaluate 10 rules in the last 5m. (Critical) Expected results ================ There are no Prometheus related alerts raised. Prometheus can load all rules without any problems. Additional info =============== Defaults from ocs-ci we used to deploy the cluster on GCP. The alert in question uses query: ``` increase(prometheus_rule_evaluation_failures_total{job=~"prometheus-k8s|prometheus-user-workload"}[5m]) > 0 ``` When I fetch logs from prometheus pod of one of the prometheus pods: ``` $ oc logs pod/prometheus-k8s-0 -c prometheus -n openshift-monitoring > prometheus.log ``` I see that Prometheus complains about some OCS rule: ``` level=warn ts=2020-11-12T10:54:36.763Z caller=manager.go:598 component="rule manager" group=kubernetes.rules msg="Evaluating rule failed" rule="record: cluster:kubelet_volume_stats_used_bytes:provisioner:sum\nexpr: sum by(provisioner) (kubelet_volume_stats_used_bytes * on(namespace, persistentvolumeclaim) group_right() (kube_persistentvolumeclaim_info * on(storageclass) group_left(provisioner) kube_storageclass_info))\n" err="found duplicate series for the match group {namespace=\"openshift-image-registry\", persistentvolumeclaim=\"registry-cephfs-rwx-pvc\"} on the left hand-side of the operation: [{__name__=\"kubelet_volume_stats_used_bytes\", endpoint=\"https-metrics\", instance=\"10.0.32.4:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"openshift-image-registry\", node=\"mbukatov-1112a-bz-ctvx4-worker-d-vf95w.c.ocs4-283313.internal\", persistentvolumeclaim=\"registry-cephfs-rwx-pvc\", service=\"kubelet\"}, {__name__=\"kubelet_volume_stats_used_bytes\", endpoint=\"https-metrics\", instance=\"10.0.32.3:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"openshift-image-registry\", node=\"mbukatov-1112a-bz-ctvx4-worker-c-xd4fm.c.ocs4-283313.internal\", persistentvolumeclaim=\"registry-cephfs-rwx-pvc\", service=\"kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side" ``` This warning repeats in the log, and the nature of the problem suggests that it's related to the alert.
Full version report =================== cluster channel: stable-4.7 cluster version: 4.7.0-0.nightly-2020-11-12-032522 cluster image: registry.svc.ci.openshift.org/ocp/release@sha256:612d1b2cf58677b07128490eb60c20ee5f0647fef9e3d087c73aded87af93216 storage namespace openshift-cluster-storage-operator image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bfc29fc0584d1770bc965ba8de2d09c405322f6a40d3101bd0ca3703429d947 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bfc29fc0584d1770bc965ba8de2d09c405322f6a40d3101bd0ca3703429d947 image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2b28e8ff5ee759e2706ef8b77b625b324927cc48d44396a2847ee8d038d900c0 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2b28e8ff5ee759e2706ef8b77b625b324927cc48d44396a2847ee8d038d900c0 image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4ebe6809285deb32cd141ad871bf596fb9f85326e98e0d0e9ead3399cee03faa * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4ebe6809285deb32cd141ad871bf596fb9f85326e98e0d0e9ead3399cee03faa storage namespace openshift-kube-storage-version-migrator image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:15a7d4226fc97eb8cdeed28302d4315b6cb16f131e010aefa7b0b52360745872 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:15a7d4226fc97eb8cdeed28302d4315b6cb16f131e010aefa7b0b52360745872 storage namespace openshift-kube-storage-version-migrator-operator image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:209074eeed356e9f77e0c5684ab92b62bed8526ac1177da099570d918977644c * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:209074eeed356e9f77e0c5684ab92b62bed8526ac1177da099570d918977644c storage namespace openshift-storage image quay.io/rhceph-dev/cephcsi@sha256:f0818e50f378f7dd9a4c1ea417a21413ee092bbf602b52a3bdf6fd1a39adea7a * quay.io/rhceph-dev/cephcsi@sha256:c5d2737d5cd5b0ec2f48b649338cad0c5a5564b6d4420b198fc7f7c5518c07ab image quay.io/rhceph-dev/ose-csi-node-driver-registrar@sha256:ef3e9e1eed457866b5beb45415dd389a47b68b0e1d40dff0a42a7ea7bf96157b * quay.io/rhceph-dev/ose-csi-node-driver-registrar@sha256:d6c9c01f82058e11615ff4e70fb9ded29e37925a917fe72092b1229be565e693 image quay.io/rhceph-dev/ose-csi-external-attacher@sha256:3aaf8beb8ecc26a71660de0959cbcfd701ab5133dbe7319b5d60746cd9a8e4c9 * quay.io/rhceph-dev/ose-csi-external-attacher@sha256:3aaf8beb8ecc26a71660de0959cbcfd701ab5133dbe7319b5d60746cd9a8e4c9 image quay.io/rhceph-dev/ose-csi-external-provisioner@sha256:92ea53f9b409f31a02d5220ceb6fc86d945e9465f78fbb9bf3523056ac53463c * quay.io/rhceph-dev/ose-csi-external-provisioner@sha256:92ea53f9b409f31a02d5220ceb6fc86d945e9465f78fbb9bf3523056ac53463c image quay.io/rhceph-dev/ose-csi-external-resizer@sha256:9621ec39c25f1eeb0a3f0f712b4a10b4f2d02cd32dbe493f2a60eea16868e811 * quay.io/rhceph-dev/ose-csi-external-resizer@sha256:39686454eb334c004e40412b715858a4bce56c5b4efc861f24f98bdfd01d5e89 image quay.io/rhceph-dev/ose-csi-external-snapshotter@sha256:6e71727a0526328a258709b705ae5b5bab9d4d6ef357ccfa71882914a6d98295 * quay.io/rhceph-dev/ose-csi-external-snapshotter@sha256:6e71727a0526328a258709b705ae5b5bab9d4d6ef357ccfa71882914a6d98295 image quay.io/rhceph-dev/mcg-core@sha256:4fd42e1593f660573102487f80bceefdca94a00e1ca2231ae1f812d5569e9f63 * quay.io/rhceph-dev/mcg-core@sha256:4fd42e1593f660573102487f80bceefdca94a00e1ca2231ae1f812d5569e9f63 image registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:6abfa44b8b4d7b45d83b1158865194cb64481148701977167e900e5db4e1eba3 * registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:6abfa44b8b4d7b45d83b1158865194cb64481148701977167e900e5db4e1eba3 image quay.io/rhceph-dev/mcg-operator@sha256:041a13deba6cc420c68b84bcc2fb38123dff9542f32935d07b5a1529a30171e8 * quay.io/rhceph-dev/mcg-operator@sha256:041a13deba6cc420c68b84bcc2fb38123dff9542f32935d07b5a1529a30171e8 image quay.io/rhceph-dev/ocs-operator@sha256:2868c5a4409de690182379cb32a9237c354be4c4a0786dd1cc555864c063f698 * quay.io/rhceph-dev/ocs-operator@sha256:2474cc057a01d913fb7ae0c9b1ff011cf073d90740ea749af794e414c278f208 image quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9 * quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9 image quay.io/rhceph-dev/rook-ceph@sha256:684b92c0059955a5ed6e647654c36616c2555f1ede3e0d43a368067db00b392f * quay.io/rhceph-dev/rook-ceph@sha256:5b33a6dfc6021ea2d3c4ce8f2302e84a86e9d47698c277f02c9a1ecae780ed1e
With OCP 4.6.0-0.nightly-2020-11-26-234822 and OCS 4.6.0-160.ci on GCP, I can see the same issue as well. And I can confirm that I also see some issue with image registry. Besides already noted alerts: - Critical PrometheusRuleFailures Prometheus openshift-monitoring/prometheus-k8s-0 has failed to evaluate 10 rules in the last 5m. - Critical PrometheusRuleFailures Prometheus openshift-monitoring/prometheus-k8s-0 has failed to evaluate 10 rules in the last 5m. I also see: - Warning Image Registry Storage configuration has changed in the last 30 minutes. This change may have caused data loss. But it was firing for a brief period of time after installation only, which may be why I haven't seen this during original report. I need to check how it behaves if I disable prometheus reconfiguration which makes prometheus to store data on OCS (this code is part of ocs-ci).
Created attachment 1734095 [details] screenshot #1: Cluster dashboard of 4.6 cluster on GCP right afer installation
(In reply to Martin Bukatovic from comment #6) > I need to check how it behaves if I disable prometheus reconfiguration which > makes prometheus to store data on OCS (this code is part of ocs-ci). When I redeployed a cluster without using OCS for openshift-monitoring (Prometheus and alertmanager) storage[1], I still noticed the same issue. - OCP 4.6.0-0.nightly-2020-11-26-234822 - OCS 4.6.0-160.ci [1] setting persistent-monitoring to false in ocs-ci https://github.com/red-hat-storage/ocs-ci/blob/0269048a15f9c86b7d41dce055ca87f5f77f8033/conf/examples/without_presistent_monitoring.yaml
The BZ has occurred because of a Rule for telemetry and has to be fixed in OCP. There is already a BZ for this. Thus, closing as duplicate fo https://bugzilla.redhat.com/show_bug.cgi?id=1879520 as it is already being worked upon. *** This bug has been marked as a duplicate of bug 1879520 ***
Moving to POST, as this BZ is tracking 1 BZ that is already in VERIFIED state and 1 that is in POST.
Moving to ON_QA as the dependant bugs are already moved to 'VERIFIED'
Tracked bugs 1903464 and 1907830 are now verified. Moving to VERIFIED