Bug 1932626 - During a 4.8 GCP upgrade OLM fires an alert indicating the operator is unhealthy
Summary: During a 4.8 GCP upgrade OLM fires an alert indicating the operator is unhealthy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Joe Lanford
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-24 19:29 UTC by Clayton Coleman
Modified: 2021-07-27 22:48 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:48:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 2024 0 None open Bug 1932626: Gracefully handle service unavailable errors from kube-apiserver 2021-03-02 00:55:53 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:48:34 UTC

Description Clayton Coleman 2021-02-24 19:29:51 UTC
We are adding tests that verify no alerts are firing during upgrades that aren't failures.  This caught a failure that was happening during a run (good!), so we need to investigate why this alert is failing and fix the bug that is causing it.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1364383982690504704

failed on 

    [
      {
        "metric": {
          "alertname": "FailingOperator",
          "alertstate": "firing",
          "container": "olm-operator",
          "endpoint": "https-metrics",
          "exported_namespace": "openshift-operator-lifecycle-manager",
          "instance": "10.129.0.28:8081",
          "job": "olm-operator-metrics",
          "name": "packageserver",
          "namespace": "openshift-operator-lifecycle-manager",
          "phase": "Failed",
          "pod": "olm-operator-6bfbc48f47-g5k88",
          "prometheus": "openshift-monitoring/k8s",
          "reason": "ComponentUnhealthy",
          "service": "olm-operator-metrics",
          "severity": "warning",
          "version": "0.17.0"
        },
        "value": [
          1614136350.148,
          "1"
        ]
      }

No operator should fail during upgrade anyway (normal behavior of upgrade is handling transition cleanly).

Setting to high because it looks like a legitimate failure vs just an alert flake (alert being too sensitive).

Comment 2 Clayton Coleman 2021-02-26 00:36:16 UTC
Also happening in normal e2e runs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25925/pull-ci-openshift-origin-master-e2e-gcp/1365050126166396928

Happening in about 5% of runs which is a top CI blocker.

Please look at mitigating ASAP.

Comment 7 errata-xmlrpc 2021-07-27 22:48:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.