We are adding tests that verify no alerts are firing during upgrades that aren't failures. This caught a failure that was happening during a run (good!), so we need to investigate why this alert is failing and fix the bug that is causing it. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1364383982690504704 failed on [ { "metric": { "alertname": "FailingOperator", "alertstate": "firing", "container": "olm-operator", "endpoint": "https-metrics", "exported_namespace": "openshift-operator-lifecycle-manager", "instance": "10.129.0.28:8081", "job": "olm-operator-metrics", "name": "packageserver", "namespace": "openshift-operator-lifecycle-manager", "phase": "Failed", "pod": "olm-operator-6bfbc48f47-g5k88", "prometheus": "openshift-monitoring/k8s", "reason": "ComponentUnhealthy", "service": "olm-operator-metrics", "severity": "warning", "version": "0.17.0" }, "value": [ 1614136350.148, "1" ] } No operator should fail during upgrade anyway (normal behavior of upgrade is handling transition cleanly). Setting to high because it looks like a legitimate failure vs just an alert flake (alert being too sensitive).
Also happening in normal e2e runs: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25925/pull-ci-openshift-origin-master-e2e-gcp/1365050126166396928 Happening in about 5% of runs which is a top CI blocker. Please look at mitigating ASAP.
There is no wrong OLM alert firing in the upgrade and e2e tests. For example, upgrade: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25030/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1370909958069030912 e2e: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25959/pull-ci-openshift-origin-master-e2e-gcp/1371120660889210880 LGTM, verifiy it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438