1932624 – ClusterMonitoringOperatorReconciliationErrors is pending at the end of an upgrade and probably should not be

Bug 1932624 - ClusterMonitoringOperatorReconciliationErrors is pending at the end of an upgrade and probably should not be

Summary: ClusterMonitoringOperatorReconciliationErrors is pending at the end of an upg...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-24 19:18 UTC by Clayton Coleman
Modified:	2021-07-27 22:48 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-arch] Check if alerts are firing during or after upgrade success
Last Closed:	2021-07-27 22:48:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1078	0	None	open	Bug 1932624: jsonnet/rules,pkg/operator: use gauge based reconcilation state metrics and alerting	2021-03-11 13:41:44 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:48:34 UTC

Description Clayton Coleman 2021-02-24 19:18:01 UTC

Reconciliation errors should be truly exceptional, not "normal" during a rollout.

We are trying to eliminate sources of noise in upgrades by targeting alerts that fire or are pending at the end of the run.  By tightening these tests, teams will have clear indicators they are introducing potential alert noise.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1364383982690504704

demonstrates this for the ClusterMonitoringOperatorReconciliationErrors which is pending 1m after upgrade is complete. I would except reconciliation to not be pending because CMO should handle normal disruption errors silently and other components should not disrupt CMO during upgrade (I.e. control plane). I *suspect* this is because of the known GCP issue where some API requests are disrupted, so feel free to blame this on https://bugzilla.redhat.com/show_bug.cgi?id=1925698 for now.

I'm filing this so I have a record in the skip in the test for the allowlist of exceptions.

Comment 1 W. Trevor King 2021-03-19 23:07:04 UTC

*** Bug 1940933 has been marked as a duplicate of this bug. ***

Comment 2 W. Trevor King 2021-03-19 23:07:57 UTC

I'm making it easier for Sippy to find this bug by mentioning the relevant test-case.

Comment 9 errata-xmlrpc 2021-07-27 22:48:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.