Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1978829

Summary:	ClusterMonitoringOperatorReconciliationErrors is firing during upgrades and should not be
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Monitoring	Assignee:	Philip Gough <pgough>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	amuller, anpicker, aos-bugs, erooth, pgough, spasquie, sthaha, wking
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1999148 (view as bug list)		Environment:	job=release-openshift-origin-installer-old-rhcos-e2e-aws-4.9=all
Last Closed:	2021-10-18 17:38:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1999148

Description Ben Parees 2021-07-02 21:15:16 UTC

Description of problem:

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/release-openshift-origin-installer-old-rhcos-e2e-aws-4.9

is pretty consistently failing due to:

fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:190]: Jul  1 16:24:28.261: Unexpected alerts fired or pending during the upgrade:

alert ClusterMonitoringOperatorReconciliationErrors fired for 60 seconds with labels: {severity="warning"}


example job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-old-rhcos-e2e-aws-4.9/1410608470004076544


can be seen in ci-search that this is failing most of our "old-rhcos" job runs:
https://search.ci.openshift.org/?search=ClusterMonitoringOperatorReconciliationErrors&maxAge=336h&context=1&type=bug%2Bjunit&name=release-openshift-origin-installer-old-rhcos-e2e-aws-4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job


Note that this job uses v4.8 nodes with v4.9 control plane, which may be part of the issue, but I also see there have been other bugs in this space that were marked resolved recently, e.g.:

https://bugzilla.redhat.com/show_bug.cgi?id=1932624

Version-Release number of selected component (if applicable):
4.9

How reproducible:
failing semi regularly.


Actual results:
Alert fires for 60s during upgrade

Expected results:
Alert should not fire during upgrade (we expect no warning alerts during upgrades)


It may be necessary to configure this alert w/ a higher delay period before it fires, if there is not a fundamentally fixable flaw in the operator itself.

Comment 3 Junqi Zhao 2021-07-13 03:11:26 UTC

checked with 4.9.0-0.nightly-2021-07-12-143404, for clause is added

      - alert: ClusterMonitoringOperatorReconciliationErrors
        annotations:
          message: Cluster Monitoring Operator is experiencing unexpected reconciliation
            errors. Inspect the cluster-monitoring-operator log for potential root causes.
        expr: max_over_time(cluster_monitoring_operator_last_reconciliation_successful[5m])
          == 0
        for: 1h
        labels:
          severity: warning

Comment 10 errata-xmlrpc 2021-10-18 17:38:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759