Bug 1826470

Summary: [3.11] - CMO - alert triggering "Cluster Monitoring Operator is experiencing 100% errors"
Product: OpenShift Container Platform Reporter: Vladislav Walek <vwalek>
Component: MonitoringAssignee: Lili Cosic <lcosic>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, spasquie, surbania
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-06 07:36:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vladislav Walek 2020-04-21 18:29:18 UTC
Description of problem:

the alert "Cluster Monitoring Operator is experiencing 100% errors" is triggered, however, all the components are up and running.

The value of "cluster_monitoring_operator_reconcile_errors_total" is 1083. The error which fails is:

Alert Expresion:

expr: sum(rate(cluster_monitoring_operator_reconcile_errors_total[15m]))
  * 100 / sum(rate(cluster_monitoring_operator_reconcile_attempts_total[15m])) >
  10

Checking the logs, I see only errors in the CMO:
E0416 20:40:08.325556       1 operator.go:206] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0416 20:40:08.325586       1 operator.go:207] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: timed out waiting for the condition

However, I don't see any other issue in the prometheus except the one below:

prometheus-k8s-1 prometheus level=error ts=2020-04-16T23:12:07.551383179Z caller=endpoints.go:130 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"
prometheus-k8s-1 prometheus level=error ts=2020-04-16T23:12:07.551406773Z caller=endpoints.go:130 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"


Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.11


How reproducible:
n/a

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info: