Bug 1828477 - Fail upgrade CI if critical alerts are firing
Summary: Fail upgrade CI if critical alerts are firing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.4.z
Assignee: Jack Ottofaro
QA Contact: liujia
URL:
Whiteboard:
Depends On: 1828427
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-27 19:21 UTC by W. Trevor King
Modified: 2020-07-18 05:31 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Change to upgrade tests to fail test if a critical alert is firing after the upgrade has completed.
Clone Of: 1828427
Environment:
Last Closed: 2020-07-14 01:43:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25204 0 None closed Bug 1828477: Add CI test to check for critical alerts post upgrade 2021-01-13 02:56:39 UTC
Red Hat Product Errata RHBA-2020:2871 0 None None None 2020-07-14 01:44:15 UTC

Description W. Trevor King 2020-04-27 19:21:47 UTC
Cloned back to 4.4.  We'll merge the backport after the 4.5 change has had sufficient cook time to show that it's catching alerts that are too sensitive and not giving false-positives on alerts that are appropriately sensitive.  Maybe a month?  Anyhow, filing the bug so we don't forget about the backport.  We will probably not backport to 4.3, because by the time we cook the 4.4 backport, 4.3 is likely to be in the maintenance lifecycle phase.

+++ This bug was initially created as a clone of Bug #1828427 +++

Description of problem:

During upgrade CI test and after the upgrade has been applied successfully, there should not be any critical alerts firing on the cluster. 

Additional info:

With this change in place, previous bugs such as https://bugzilla.redhat.com/show_bug.cgi?id=1824988 would have been uncovered during CI.

Once https://bugzilla.redhat.com/show_bug.cgi?id=1821661, KubeAPIErrorBudgetBurn alert issue, is fixed change https://github.com/openshift/origin/pull/24786/commits/3a9233400053c036838bdbf7f992874b7a0805fd will be reverted.

Comment 1 W. Trevor King 2020-05-15 01:43:59 UTC
Will need a manual backport [1].

[1]: https://github.com/openshift/origin/pull/24786#issuecomment-628975057

Comment 2 Jack Ottofaro 2020-05-28 17:36:58 UTC
Although an important bug, I'm adding UpcomingSprint since I am occupied by other important tasks. I will revisit this bug next sprint.

Comment 5 liujia 2020-07-07 02:31:16 UTC
# curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric.alertname' | sort | uniq -c | sort -n | tail
      1 etcdMembersDown
      3 KubeNodeUnreachable

# curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results.*etcdMembersDown&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | keys[]'
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/1280000919693430784

# curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/1280000919693430784/build-log.txt| grep -B8 'report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured' | tail -n6
Failing tests:

[Feature:APIServer] [Top Level] [Feature:APIServer] anonymous browsers should get a 403 from / [Suite:openshift/conformance/parallel]
[Feature:OpenShiftAuthorization] The default cluster RBAC policy [Top Level] [Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conformance/parallel]
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should have important platform topology metrics [Suite:openshift/conformance/parallel/minimal]
[Feature:Prometheus][Late] Alerts [Top Level] [Feature:Prometheus][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]

Checked above ci job failed as expected.

Comment 8 W. Trevor King 2020-07-10 22:38:16 UTC
Twiddled the doc text to drop "CI".  Customers can run tests from the 'tests' image whenever they want, whether that's CI or otherwise.

Comment 10 Jeana Routh 2020-07-13 13:50:50 UTC
Thanks, Trevor! How about something like this?
"Previously, if there were critical alerts during upgrade tests, the upgrade completed successfully. Now upgrade tests fail if a critical alert is found."

Comment 11 errata-xmlrpc 2020-07-14 01:43:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2871

Comment 12 W. Trevor King 2020-07-18 05:31:02 UTC
(In reply to Jeana Routh from comment #10)
> "Previously, if there were critical alerts during upgrade tests, the upgrade
> completed successfully. Now upgrade tests fail if a critical alert is found."

Sounds good to me.  Not sure if it matters now that the errata is public?


Note You need to log in before you can comment on or make changes to this bug.