1828427 – Fail upgrade CI if critical alerts are firing

Bug 1828427 - Fail upgrade CI if critical alerts are firing

Summary: Fail upgrade CI if critical alerts are firing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Unknown
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Sudha Ponnaganti
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1828477
TreeView+	depends on / blocked

Reported:	2020-04-27 17:15 UTC by Jack Ottofaro
Modified:	2020-07-13 17:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Change to CI upgrade test to fail test if a critical alert is firing after the upgrade has completed.
Clone Of:
Clones:	1828477 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:31:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24786	0	None	closed	Bug 1828427: Add CI test to check for critical alerts post upgrade success	2021-01-13 02:57:20 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:32:09 UTC

Description Jack Ottofaro 2020-04-27 17:15:34 UTC

Description of problem:

During upgrade CI test and after the upgrade has been applied successfully, there should not be any critical alerts firing on the cluster. 

Additional info:

With this change in place, previous bugs such as https://bugzilla.redhat.com/show_bug.cgi?id=1824988 would have been uncovered during CI.

Once https://bugzilla.redhat.com/show_bug.cgi?id=1821661, KubeAPIErrorBudgetBurn alert issue, is fixed change https://github.com/openshift/origin/pull/24786/commits/3a9233400053c036838bdbf7f992874b7a0805fd will be reverted.

Comment 3 W. Trevor King 2020-05-15 01:33:55 UTC

This is CI, so we can VERIFY without QE.  Checking for recent alerts:

$ curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric.alertname' | sort | uniq -c | sort -n | tail
      1 ClusterOperatorDegraded
      1 KubeAPIErrorBudgetBurn
      1 KubeNodeUnreachable
      2 AggregatedAPIErrors
      2 ClusterOperatorDown
      3 etcdMembersDown
      5 ImagePruningDisabled

Finding jobs with the etcdMembersDown:

$ curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+in
correct+results.*etcdMembersDown&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | keys[]'
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/81
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3-to-4.4-to-4.5-ci/66
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci/45

Confirming that the failure was fatal (and not marked as a flaky test):

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/81/build-log.txt | grep -B8 'report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured' | tail -n9
Failing tests:

[Conformance][templates] templateinstance readiness test  [Top Level] [Conformance][templates] templateinstance readiness test  should report ready soon after all annotated objects are ready [Suite:openshift/conformance/parallel/minimal]
[Feature:APIServer] [Top Level] [Feature:APIServer] anonymous browsers should get a 403 from / [Suite:openshift/conformance/parallel]
[Feature:OpenShiftAuthorization] The default cluster RBAC policy [Top Level] [Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conformance/parallel]
[Feature:Platform] Managed cluster [Top Level] [Feature:Platform] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should have important platform topology metrics [Suite:openshift/conformance/parallel/minimal]
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Suite:openshift/conformance/parallel/minimal]
[Feature:Prometheus][Late] Alerts [Top Level] [Feature:Prometheus][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]

Comment 4 errata-xmlrpc 2020-07-13 17:31:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.