Bug 1828477

Summary: Fail upgrade CI if critical alerts are firing
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cluster Version OperatorAssignee: Jack Ottofaro <jack.ottofaro>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, eparis, jack.ottofaro, jokerman, jrouth, lmohanty, skuznets, sponnaga, wking
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Change to upgrade tests to fail test if a critical alert is firing after the upgrade has completed.
Story Points: ---
Clone Of: 1828427 Environment:
Last Closed: 2020-07-14 01:43:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1828427    
Bug Blocks:    

Description W. Trevor King 2020-04-27 19:21:47 UTC
Cloned back to 4.4.  We'll merge the backport after the 4.5 change has had sufficient cook time to show that it's catching alerts that are too sensitive and not giving false-positives on alerts that are appropriately sensitive.  Maybe a month?  Anyhow, filing the bug so we don't forget about the backport.  We will probably not backport to 4.3, because by the time we cook the 4.4 backport, 4.3 is likely to be in the maintenance lifecycle phase.

+++ This bug was initially created as a clone of Bug #1828427 +++

Description of problem:

During upgrade CI test and after the upgrade has been applied successfully, there should not be any critical alerts firing on the cluster. 

Additional info:

With this change in place, previous bugs such as https://bugzilla.redhat.com/show_bug.cgi?id=1824988 would have been uncovered during CI.

Once https://bugzilla.redhat.com/show_bug.cgi?id=1821661, KubeAPIErrorBudgetBurn alert issue, is fixed change https://github.com/openshift/origin/pull/24786/commits/3a9233400053c036838bdbf7f992874b7a0805fd will be reverted.

Comment 1 W. Trevor King 2020-05-15 01:43:59 UTC
Will need a manual backport [1].

[1]: https://github.com/openshift/origin/pull/24786#issuecomment-628975057

Comment 2 Jack Ottofaro 2020-05-28 17:36:58 UTC
Although an important bug, I'm adding UpcomingSprint since I am occupied by other important tasks. I will revisit this bug next sprint.

Comment 5 liujia 2020-07-07 02:31:16 UTC
# curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric.alertname' | sort | uniq -c | sort -n | tail
      1 etcdMembersDown
      3 KubeNodeUnreachable

# curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results.*etcdMembersDown&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | keys[]'
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/1280000919693430784

# curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/1280000919693430784/build-log.txt| grep -B8 'report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured' | tail -n6
Failing tests:

[Feature:APIServer] [Top Level] [Feature:APIServer] anonymous browsers should get a 403 from / [Suite:openshift/conformance/parallel]
[Feature:OpenShiftAuthorization] The default cluster RBAC policy [Top Level] [Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conformance/parallel]
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should have important platform topology metrics [Suite:openshift/conformance/parallel/minimal]
[Feature:Prometheus][Late] Alerts [Top Level] [Feature:Prometheus][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]

Checked above ci job failed as expected.

Comment 8 W. Trevor King 2020-07-10 22:38:16 UTC
Twiddled the doc text to drop "CI".  Customers can run tests from the 'tests' image whenever they want, whether that's CI or otherwise.

Comment 10 Jeana Routh 2020-07-13 13:50:50 UTC
Thanks, Trevor! How about something like this?
"Previously, if there were critical alerts during upgrade tests, the upgrade completed successfully. Now upgrade tests fail if a critical alert is found."

Comment 11 errata-xmlrpc 2020-07-14 01:43:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2871

Comment 12 W. Trevor King 2020-07-18 05:31:02 UTC
(In reply to Jeana Routh from comment #10)
> "Previously, if there were critical alerts during upgrade tests, the upgrade
> completed successfully. Now upgrade tests fail if a critical alert is found."

Sounds good to me.  Not sure if it matters now that the errata is public?