1828477 – Fail upgrade CI if critical alerts are firing

Bug 1828477 - Fail upgrade CI if critical alerts are firing

Summary: Fail upgrade CI if critical alerts are firing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Jack Ottofaro
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:	1828427
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-27 19:21 UTC by W. Trevor King
Modified:	2020-07-18 05:31 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Change to upgrade tests to fail test if a critical alert is firing after the upgrade has completed.
Clone Of:	1828427
Environment:
Last Closed:	2020-07-14 01:43:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25204	0	None	closed	Bug 1828477: Add CI test to check for critical alerts post upgrade	2021-01-13 02:56:39 UTC
Red Hat Product Errata	RHBA-2020:2871	0	None	None	None	2020-07-14 01:44:15 UTC

Description W. Trevor King 2020-04-27 19:21:47 UTC

Cloned back to 4.4.  We'll merge the backport after the 4.5 change has had sufficient cook time to show that it's catching alerts that are too sensitive and not giving false-positives on alerts that are appropriately sensitive.  Maybe a month?  Anyhow, filing the bug so we don't forget about the backport.  We will probably not backport to 4.3, because by the time we cook the 4.4 backport, 4.3 is likely to be in the maintenance lifecycle phase.

+++ This bug was initially created as a clone of Bug #1828427 +++

Description of problem:

During upgrade CI test and after the upgrade has been applied successfully, there should not be any critical alerts firing on the cluster. 

Additional info:

With this change in place, previous bugs such as https://bugzilla.redhat.com/show_bug.cgi?id=1824988 would have been uncovered during CI.

Once https://bugzilla.redhat.com/show_bug.cgi?id=1821661, KubeAPIErrorBudgetBurn alert issue, is fixed change https://github.com/openshift/origin/pull/24786/commits/3a9233400053c036838bdbf7f992874b7a0805fd will be reverted.

Comment 1 W. Trevor King 2020-05-15 01:43:59 UTC

Will need a manual backport [1].

[1]: https://github.com/openshift/origin/pull/24786#issuecomment-628975057

Comment 2 Jack Ottofaro 2020-05-28 17:36:58 UTC

Although an important bug, I'm adding UpcomingSprint since I am occupied by other important tasks. I will revisit this bug next sprint.

Comment 5 liujia 2020-07-07 02:31:16 UTC

# curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric.alertname' | sort | uniq -c | sort -n | tail
      1 etcdMembersDown
      3 KubeNodeUnreachable

# curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results.*etcdMembersDown&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | keys[]'
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/1280000919693430784

# curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/1280000919693430784/build-log.txt| grep -B8 'report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured' | tail -n6
Failing tests:

[Feature:APIServer] [Top Level] [Feature:APIServer] anonymous browsers should get a 403 from / [Suite:openshift/conformance/parallel]
[Feature:OpenShiftAuthorization] The default cluster RBAC policy [Top Level] [Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conformance/parallel]
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should have important platform topology metrics [Suite:openshift/conformance/parallel/minimal]
[Feature:Prometheus][Late] Alerts [Top Level] [Feature:Prometheus][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]

Checked above ci job failed as expected.

Comment 8 W. Trevor King 2020-07-10 22:38:16 UTC

Twiddled the doc text to drop "CI".  Customers can run tests from the 'tests' image whenever they want, whether that's CI or otherwise.

Comment 10 Jeana Routh 2020-07-13 13:50:50 UTC

Thanks, Trevor! How about something like this?
"Previously, if there were critical alerts during upgrade tests, the upgrade completed successfully. Now upgrade tests fail if a critical alert is found."

Comment 11 errata-xmlrpc 2020-07-14 01:43:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2871

Comment 12 W. Trevor King 2020-07-18 05:31:02 UTC

(In reply to Jeana Routh from comment #10)
> "Previously, if there were critical alerts during upgrade tests, the upgrade
> completed successfully. Now upgrade tests fail if a critical alert is found."

Sounds good to me.  Not sure if it matters now that the errata is public?

Note You need to log in before you can comment on or make changes to this bug.