Bug 1828477
Summary: | Fail upgrade CI if critical alerts are firing | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Cluster Version Operator | Assignee: | Jack Ottofaro <jack.ottofaro> |
Status: | CLOSED ERRATA | QA Contact: | liujia <jiajliu> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.3.0 | CC: | aos-bugs, eparis, jack.ottofaro, jokerman, jrouth, lmohanty, skuznets, sponnaga, wking |
Target Milestone: | --- | ||
Target Release: | 4.4.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: |
Change to upgrade tests to fail test if a critical alert is firing after the upgrade has completed.
|
Story Points: | --- |
Clone Of: | 1828427 | Environment: | |
Last Closed: | 2020-07-14 01:43:52 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1828427 | ||
Bug Blocks: |
Description
W. Trevor King
2020-04-27 19:21:47 UTC
Will need a manual backport [1]. [1]: https://github.com/openshift/origin/pull/24786#issuecomment-628975057 Although an important bug, I'm adding UpcomingSprint since I am occupied by other important tasks. I will revisit this bug next sprint. # curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric.alertname' | sort | uniq -c | sort -n | tail 1 etcdMembersDown 3 KubeNodeUnreachable # curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results.*etcdMembersDown&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | keys[]' https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/1280000919693430784 # curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/1280000919693430784/build-log.txt| grep -B8 'report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured' | tail -n6 Failing tests: [Feature:APIServer] [Top Level] [Feature:APIServer] anonymous browsers should get a 403 from / [Suite:openshift/conformance/parallel] [Feature:OpenShiftAuthorization] The default cluster RBAC policy [Top Level] [Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conformance/parallel] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should have important platform topology metrics [Suite:openshift/conformance/parallel/minimal] [Feature:Prometheus][Late] Alerts [Top Level] [Feature:Prometheus][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] Checked above ci job failed as expected. Twiddled the doc text to drop "CI". Customers can run tests from the 'tests' image whenever they want, whether that's CI or otherwise. Thanks, Trevor! How about something like this? "Previously, if there were critical alerts during upgrade tests, the upgrade completed successfully. Now upgrade tests fail if a critical alert is found." Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2871 (In reply to Jeana Routh from comment #10) > "Previously, if there were critical alerts during upgrade tests, the upgrade > completed successfully. Now upgrade tests fail if a critical alert is found." Sounds good to me. Not sure if it matters now that the errata is public? |