Bug 1932626

Summary:	During a 4.8 GCP upgrade OLM fires an alert indicating the operator is unhealthy
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	OLM	Assignee:	Joe Lanford <jlanford>
OLM sub component:	OLM	QA Contact:	Jian Zhang <jiazha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	bluddy, jlanford
Version:	4.8	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:48:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-02-24 19:29:51 UTC

We are adding tests that verify no alerts are firing during upgrades that aren't failures.  This caught a failure that was happening during a run (good!), so we need to investigate why this alert is failing and fix the bug that is causing it.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1364383982690504704

failed on 

    [
      {
        "metric": {
          "alertname": "FailingOperator",
          "alertstate": "firing",
          "container": "olm-operator",
          "endpoint": "https-metrics",
          "exported_namespace": "openshift-operator-lifecycle-manager",
          "instance": "10.129.0.28:8081",
          "job": "olm-operator-metrics",
          "name": "packageserver",
          "namespace": "openshift-operator-lifecycle-manager",
          "phase": "Failed",
          "pod": "olm-operator-6bfbc48f47-g5k88",
          "prometheus": "openshift-monitoring/k8s",
          "reason": "ComponentUnhealthy",
          "service": "olm-operator-metrics",
          "severity": "warning",
          "version": "0.17.0"
        },
        "value": [
          1614136350.148,
          "1"
        ]
      }

No operator should fail during upgrade anyway (normal behavior of upgrade is handling transition cleanly).

Setting to high because it looks like a legitimate failure vs just an alert flake (alert being too sensitive).

Comment 2 Clayton Coleman 2021-02-26 00:36:16 UTC

Also happening in normal e2e runs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25925/pull-ci-openshift-origin-master-e2e-gcp/1365050126166396928

Happening in about 5% of runs which is a top CI blocker.

Please look at mitigating ASAP.

Comment 4 Jian Zhang 2021-03-15 09:24:35 UTC

There is no wrong OLM alert firing in the upgrade and e2e tests. For example, upgrade: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25030/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1370909958069030912 e2e: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25959/pull-ci-openshift-origin-master-e2e-gcp/1371120660889210880
LGTM, verifiy it.

Comment 7 errata-xmlrpc 2021-07-27 22:48:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438