Bug 1790825

Summary: [ci] Flaky test: Prometheus when installed on the cluster shouldn't report any alerts in firing state: FailingOperator etcdoperator.v0.9.4
Product: OpenShift Container Platform Reporter: Martin André <m.andre>
Component: OLMAssignee: Evan Cordell <ecordell>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: bparees, ccoleman, dcbw, jerzhang, jiazha, nhale, wking
Version: 4.4   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1805588 (view as bug list) Environment:
Last Closed: 2020-07-13 17:13:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1805588, 1822232    

Description Martin André 2020-01-14 10:36:29 UTC
olm-operator occasionally sends an alert for etcdoperator.v0.9.4 and causes the "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel/minimal]" test to fail.

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:163]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"FailingOperator\",\"alertstate\":\"firing\",\"endpoint\":\"https-metrics\",\"exported_namespace\":\"e2e-test-olm-23440-kxqrp\",\"instance\":\"10.130.0.12:8081\",\"job\":\"olm-operator-metrics\",\"name\":\"etcdoperator.v0.9.4\",\"namespace\":\"openshift-operator-lifecycle-manager\",\"phase\":\"Failed\",\"pod\":\"olm-operator-68f46b97f8-9qhfb\",\"reason\":\"InstallComponentFailedNoRetry\",\"service\":\"olm-operator-metrics\",\"severity\":\"info\",\"version\":\"0.9.4\"},\"value\":[1578892888.389,\"1\"]}]",
        },
    }
to be empty


The openshift-operator-lifecycle-manager_olm-operator pod logs show that:

E0113 05:10:55.194946       1 queueinformer_operator.go:282] sync {"update" "e2e-test-olm-23440-kxqrp/etcdoperator.v0.9.4"} failed: error transitioning ClusterServiceVersion: requirements were not met and error updating CSV status: error updating ClusterServiceVersion status: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com "etcdoperator.v0.9.4": StorageError: invalid object, Code: 4, Key: /kubernetes.io/operators.coreos.com/clusterserviceversions/e2e-test-olm-23440-kxqrp/etcdoperator.v0.9.4, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 24f992c9-ca40-44e3-a43b-f5185d351f00, UID in object meta: 


According to the search at https://ci-search-ci-search-next.svc.ci.openshift.org/?search=%22olm-operator-metrics%22%2C%22name%22%3A%22etcdoperator.v0.9.4%22&maxAge=336h&context=2&type=all, this failure touches all platforms.

Some recent examples:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/3561
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/769
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/377

Comment 1 Evan Cordell 2020-01-31 13:45:46 UTC
*** Bug 1796739 has been marked as a duplicate of this bug. ***

Comment 2 Kevin Rizza 2020-02-20 14:18:11 UTC
*** Bug 1801907 has been marked as a duplicate of this bug. ***

Comment 3 Ben Parees 2020-03-18 00:24:06 UTC
This bug has been identified by our buildcops as a significant blocker for our merge queue.  Please ensure the fix is merged asap or provide updates here as to what progress is being made.

Comment 4 Clayton Coleman 2020-03-18 04:34:46 UTC
If the etcd operator is unreliable, switch to a simpler operator that won't be so flaky.

Comment 12 Ben Parees 2020-03-25 14:17:05 UTC
Likely this test is still flaking (due to other alerts).  If someone can confirm that the OLM alert is no longer firing in recent CI jobs, we can self-verify.  Nick can you do that?


QE isn't generally in a good position to verify fixes to CI flakes.

Comment 14 Evan Cordell 2020-03-26 17:05:37 UTC
https://search.svc.ci.openshift.org/chart?search=%22olm-operator-metrics%22%2C%22name%22%3A%22etcdoperator.v0.9.4%22&maxAge=336h&context=2&type=all

This shows only 2 instances of the issue over the last 14 days, down from a relatively high % of runs. The instances I looked at seemed to have underlying cluster etcd issues, which could manifest as stale cache in OLM (and thus see this issue).

Comment 15 W. Trevor King 2020-04-08 16:15:17 UTC
I picked one of the issues to use in the bug title, so it's more specific than just the generic alert unit name.

Comment 17 errata-xmlrpc 2020-07-13 17:13:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409