Bug 1932624 - ClusterMonitoringOperatorReconciliationErrors is pending at the end of an upgrade and probably should not be
Summary: ClusterMonitoringOperatorReconciliationErrors is pending at the end of an upg...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.8.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-24 19:18 UTC by Clayton Coleman
Modified: 2021-07-27 22:48 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
[sig-arch] Check if alerts are firing during or after upgrade success
Last Closed: 2021-07-27 22:48:13 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1078 0 None open Bug 1932624: jsonnet/rules,pkg/operator: use gauge based reconcilation state metrics and alerting 2021-03-11 13:41:44 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:48:34 UTC

Description Clayton Coleman 2021-02-24 19:18:01 UTC
Reconciliation errors should be truly exceptional, not "normal" during a rollout.

We are trying to eliminate sources of noise in upgrades by targeting alerts that fire or are pending at the end of the run.  By tightening these tests, teams will have clear indicators they are introducing potential alert noise.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1364383982690504704

demonstrates this for the ClusterMonitoringOperatorReconciliationErrors which is pending 1m after upgrade is complete. I would except reconciliation to not be pending because CMO should handle normal disruption errors silently and other components should not disrupt CMO during upgrade (I.e. control plane). I *suspect* this is because of the known GCP issue where some API requests are disrupted, so feel free to blame this on https://bugzilla.redhat.com/show_bug.cgi?id=1925698 for now.

I'm filing this so I have a record in the skip in the test for the allowlist of exceptions.

Comment 1 W. Trevor King 2021-03-19 23:07:04 UTC
*** Bug 1940933 has been marked as a duplicate of this bug. ***

Comment 2 W. Trevor King 2021-03-19 23:07:57 UTC
I'm making it easier for Sippy to find this bug by mentioning the relevant test-case.

Comment 9 errata-xmlrpc 2021-07-27 22:48:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.