Cloned back to 4.4. We'll merge the backport after the 4.5 change has had sufficient cook time to show that it's catching alerts that are too sensitive and not giving false-positives on alerts that are appropriately sensitive. Maybe a month? Anyhow, filing the bug so we don't forget about the backport. We will probably not backport to 4.3, because by the time we cook the 4.4 backport, 4.3 is likely to be in the maintenance lifecycle phase. +++ This bug was initially created as a clone of Bug #1828427 +++ Description of problem: During upgrade CI test and after the upgrade has been applied successfully, there should not be any critical alerts firing on the cluster. Additional info: With this change in place, previous bugs such as https://bugzilla.redhat.com/show_bug.cgi?id=1824988 would have been uncovered during CI. Once https://bugzilla.redhat.com/show_bug.cgi?id=1821661, KubeAPIErrorBudgetBurn alert issue, is fixed change https://github.com/openshift/origin/pull/24786/commits/3a9233400053c036838bdbf7f992874b7a0805fd will be reverted.
Will need a manual backport [1]. [1]: https://github.com/openshift/origin/pull/24786#issuecomment-628975057
Although an important bug, I'm adding UpcomingSprint since I am occupied by other important tasks. I will revisit this bug next sprint.
# curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric.alertname' | sort | uniq -c | sort -n | tail 1 etcdMembersDown 3 KubeNodeUnreachable # curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results.*etcdMembersDown&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | keys[]' https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/1280000919693430784 # curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/1280000919693430784/build-log.txt| grep -B8 'report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured' | tail -n6 Failing tests: [Feature:APIServer] [Top Level] [Feature:APIServer] anonymous browsers should get a 403 from / [Suite:openshift/conformance/parallel] [Feature:OpenShiftAuthorization] The default cluster RBAC policy [Top Level] [Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conformance/parallel] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should have important platform topology metrics [Suite:openshift/conformance/parallel/minimal] [Feature:Prometheus][Late] Alerts [Top Level] [Feature:Prometheus][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] Checked above ci job failed as expected.
Twiddled the doc text to drop "CI". Customers can run tests from the 'tests' image whenever they want, whether that's CI or otherwise.
Thanks, Trevor! How about something like this? "Previously, if there were critical alerts during upgrade tests, the upgrade completed successfully. Now upgrade tests fail if a critical alert is found."
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2871
(In reply to Jeana Routh from comment #10) > "Previously, if there were critical alerts during upgrade tests, the upgrade > completed successfully. Now upgrade tests fail if a critical alert is found." Sounds good to me. Not sure if it matters now that the errata is public?