Bug 1932626

Summary: During a 4.8 GCP upgrade OLM fires an alert indicating the operator is unhealthy
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: OLMAssignee: Joe Lanford <jlanford>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: bluddy, jlanford
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:48:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-02-24 19:29:51 UTC
We are adding tests that verify no alerts are firing during upgrades that aren't failures.  This caught a failure that was happening during a run (good!), so we need to investigate why this alert is failing and fix the bug that is causing it.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1364383982690504704

failed on 

    [
      {
        "metric": {
          "alertname": "FailingOperator",
          "alertstate": "firing",
          "container": "olm-operator",
          "endpoint": "https-metrics",
          "exported_namespace": "openshift-operator-lifecycle-manager",
          "instance": "10.129.0.28:8081",
          "job": "olm-operator-metrics",
          "name": "packageserver",
          "namespace": "openshift-operator-lifecycle-manager",
          "phase": "Failed",
          "pod": "olm-operator-6bfbc48f47-g5k88",
          "prometheus": "openshift-monitoring/k8s",
          "reason": "ComponentUnhealthy",
          "service": "olm-operator-metrics",
          "severity": "warning",
          "version": "0.17.0"
        },
        "value": [
          1614136350.148,
          "1"
        ]
      }

No operator should fail during upgrade anyway (normal behavior of upgrade is handling transition cleanly).

Setting to high because it looks like a legitimate failure vs just an alert flake (alert being too sensitive).

Comment 2 Clayton Coleman 2021-02-26 00:36:16 UTC
Also happening in normal e2e runs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25925/pull-ci-openshift-origin-master-e2e-gcp/1365050126166396928

Happening in about 5% of runs which is a top CI blocker.

Please look at mitigating ASAP.

Comment 7 errata-xmlrpc 2021-07-27 22:48:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438