Bug 1828427 - Fail upgrade CI if critical alerts are firing
Summary: Fail upgrade CI if critical alerts are firing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Unknown
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Sudha Ponnaganti
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks: 1828477
TreeView+ depends on / blocked
 
Reported: 2020-04-27 17:15 UTC by Jack Ottofaro
Modified: 2020-07-13 17:32 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Change to CI upgrade test to fail test if a critical alert is firing after the upgrade has completed.
Clone Of:
: 1828477 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:31:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24786 0 None closed Bug 1828427: Add CI test to check for critical alerts post upgrade success 2021-01-13 02:57:20 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:32:09 UTC

Description Jack Ottofaro 2020-04-27 17:15:34 UTC
Description of problem:

During upgrade CI test and after the upgrade has been applied successfully, there should not be any critical alerts firing on the cluster. 

Additional info:

With this change in place, previous bugs such as https://bugzilla.redhat.com/show_bug.cgi?id=1824988 would have been uncovered during CI.

Once https://bugzilla.redhat.com/show_bug.cgi?id=1821661, KubeAPIErrorBudgetBurn alert issue, is fixed change https://github.com/openshift/origin/pull/24786/commits/3a9233400053c036838bdbf7f992874b7a0805fd will be reverted.

Comment 3 W. Trevor King 2020-05-15 01:33:55 UTC
This is CI, so we can VERIFY without QE.  Checking for recent alerts:

$ curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric.alertname' | sort | uniq -c | sort -n | tail
      1 ClusterOperatorDegraded
      1 KubeAPIErrorBudgetBurn
      1 KubeNodeUnreachable
      2 AggregatedAPIErrors
      2 ClusterOperatorDown
      3 etcdMembersDown
      5 ImagePruningDisabled

Finding jobs with the etcdMembersDown:

$ curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+in
correct+results.*etcdMembersDown&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | keys[]'
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/81
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3-to-4.4-to-4.5-ci/66
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci/45

Confirming that the failure was fatal (and not marked as a flaky test):

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/81/build-log.txt | grep -B8 'report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured' | tail -n9
Failing tests:

[Conformance][templates] templateinstance readiness test  [Top Level] [Conformance][templates] templateinstance readiness test  should report ready soon after all annotated objects are ready [Suite:openshift/conformance/parallel/minimal]
[Feature:APIServer] [Top Level] [Feature:APIServer] anonymous browsers should get a 403 from / [Suite:openshift/conformance/parallel]
[Feature:OpenShiftAuthorization] The default cluster RBAC policy [Top Level] [Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conformance/parallel]
[Feature:Platform] Managed cluster [Top Level] [Feature:Platform] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should have important platform topology metrics [Suite:openshift/conformance/parallel/minimal]
[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Suite:openshift/conformance/parallel/minimal]
[Feature:Prometheus][Late] Alerts [Top Level] [Feature:Prometheus][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]

Comment 4 errata-xmlrpc 2020-07-13 17:31:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.