Bug 1790825 - [ci] Flaky test: Prometheus when installed on the cluster shouldn't report any alerts in firing state: FailingOperator etcdoperator.v0.9.4
Summary: [ci] Flaky test: Prometheus when installed on the cluster shouldn't report an...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.5.0
Assignee: Evan Cordell
QA Contact: Jian Zhang
URL:
Whiteboard:
: 1796739 1801907 (view as bug list)
Depends On:
Blocks: 1805588 1822232
TreeView+ depends on / blocked
 
Reported: 2020-01-14 10:36 UTC by Martin André
Modified: 2020-07-13 17:13 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1805588 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:13:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24436 0 None closed Bug 1790825: fix test flake in olm tests 2020-09-16 07:51:03 UTC
Github openshift origin pull 24723 0 None closed Bug 1790825: fix test flake in operators test 2020-09-16 07:51:03 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:13:43 UTC

Description Martin André 2020-01-14 10:36:29 UTC
olm-operator occasionally sends an alert for etcdoperator.v0.9.4 and causes the "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel/minimal]" test to fail.

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:163]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"FailingOperator\",\"alertstate\":\"firing\",\"endpoint\":\"https-metrics\",\"exported_namespace\":\"e2e-test-olm-23440-kxqrp\",\"instance\":\"10.130.0.12:8081\",\"job\":\"olm-operator-metrics\",\"name\":\"etcdoperator.v0.9.4\",\"namespace\":\"openshift-operator-lifecycle-manager\",\"phase\":\"Failed\",\"pod\":\"olm-operator-68f46b97f8-9qhfb\",\"reason\":\"InstallComponentFailedNoRetry\",\"service\":\"olm-operator-metrics\",\"severity\":\"info\",\"version\":\"0.9.4\"},\"value\":[1578892888.389,\"1\"]}]",
        },
    }
to be empty


The openshift-operator-lifecycle-manager_olm-operator pod logs show that:

E0113 05:10:55.194946       1 queueinformer_operator.go:282] sync {"update" "e2e-test-olm-23440-kxqrp/etcdoperator.v0.9.4"} failed: error transitioning ClusterServiceVersion: requirements were not met and error updating CSV status: error updating ClusterServiceVersion status: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com "etcdoperator.v0.9.4": StorageError: invalid object, Code: 4, Key: /kubernetes.io/operators.coreos.com/clusterserviceversions/e2e-test-olm-23440-kxqrp/etcdoperator.v0.9.4, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 24f992c9-ca40-44e3-a43b-f5185d351f00, UID in object meta: 


According to the search at https://ci-search-ci-search-next.svc.ci.openshift.org/?search=%22olm-operator-metrics%22%2C%22name%22%3A%22etcdoperator.v0.9.4%22&maxAge=336h&context=2&type=all, this failure touches all platforms.

Some recent examples:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/3561
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/769
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/377

Comment 1 Evan Cordell 2020-01-31 13:45:46 UTC
*** Bug 1796739 has been marked as a duplicate of this bug. ***

Comment 2 Kevin Rizza 2020-02-20 14:18:11 UTC
*** Bug 1801907 has been marked as a duplicate of this bug. ***

Comment 3 Ben Parees 2020-03-18 00:24:06 UTC
This bug has been identified by our buildcops as a significant blocker for our merge queue.  Please ensure the fix is merged asap or provide updates here as to what progress is being made.

Comment 4 Clayton Coleman 2020-03-18 04:34:46 UTC
If the etcd operator is unreliable, switch to a simpler operator that won't be so flaky.

Comment 12 Ben Parees 2020-03-25 14:17:05 UTC
Likely this test is still flaking (due to other alerts).  If someone can confirm that the OLM alert is no longer firing in recent CI jobs, we can self-verify.  Nick can you do that?


QE isn't generally in a good position to verify fixes to CI flakes.

Comment 14 Evan Cordell 2020-03-26 17:05:37 UTC
https://search.svc.ci.openshift.org/chart?search=%22olm-operator-metrics%22%2C%22name%22%3A%22etcdoperator.v0.9.4%22&maxAge=336h&context=2&type=all

This shows only 2 instances of the issue over the last 14 days, down from a relatively high % of runs. The instances I looked at seemed to have underlying cluster etcd issues, which could manifest as stale cache in OLM (and thus see this issue).

Comment 15 W. Trevor King 2020-04-08 16:15:17 UTC
I picked one of the issues to use in the bug title, so it's more specific than just the generic alert unit name.

Comment 17 errata-xmlrpc 2020-07-13 17:13:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.