olm-operator occasionally sends an alert for etcdoperator.v0.9.4 and causes the "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel/minimal]" test to fail. fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:163]: Expected <map[string]error | len:1>: { "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\"} >= 1": { s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"FailingOperator\",\"alertstate\":\"firing\",\"endpoint\":\"https-metrics\",\"exported_namespace\":\"e2e-test-olm-23440-kxqrp\",\"instance\":\"10.130.0.12:8081\",\"job\":\"olm-operator-metrics\",\"name\":\"etcdoperator.v0.9.4\",\"namespace\":\"openshift-operator-lifecycle-manager\",\"phase\":\"Failed\",\"pod\":\"olm-operator-68f46b97f8-9qhfb\",\"reason\":\"InstallComponentFailedNoRetry\",\"service\":\"olm-operator-metrics\",\"severity\":\"info\",\"version\":\"0.9.4\"},\"value\":[1578892888.389,\"1\"]}]", }, } to be empty The openshift-operator-lifecycle-manager_olm-operator pod logs show that: E0113 05:10:55.194946 1 queueinformer_operator.go:282] sync {"update" "e2e-test-olm-23440-kxqrp/etcdoperator.v0.9.4"} failed: error transitioning ClusterServiceVersion: requirements were not met and error updating CSV status: error updating ClusterServiceVersion status: Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com "etcdoperator.v0.9.4": StorageError: invalid object, Code: 4, Key: /kubernetes.io/operators.coreos.com/clusterserviceversions/e2e-test-olm-23440-kxqrp/etcdoperator.v0.9.4, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 24f992c9-ca40-44e3-a43b-f5185d351f00, UID in object meta: According to the search at https://ci-search-ci-search-next.svc.ci.openshift.org/?search=%22olm-operator-metrics%22%2C%22name%22%3A%22etcdoperator.v0.9.4%22&maxAge=336h&context=2&type=all, this failure touches all platforms. Some recent examples: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.4/3561 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/769 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.4/377
*** Bug 1796739 has been marked as a duplicate of this bug. ***
*** Bug 1801907 has been marked as a duplicate of this bug. ***
This bug has been identified by our buildcops as a significant blocker for our merge queue. Please ensure the fix is merged asap or provide updates here as to what progress is being made.
If the etcd operator is unreliable, switch to a simpler operator that won't be so flaky.
Likely this test is still flaking (due to other alerts). If someone can confirm that the OLM alert is no longer firing in recent CI jobs, we can self-verify. Nick can you do that? QE isn't generally in a good position to verify fixes to CI flakes.
https://search.svc.ci.openshift.org/chart?search=%22olm-operator-metrics%22%2C%22name%22%3A%22etcdoperator.v0.9.4%22&maxAge=336h&context=2&type=all This shows only 2 instances of the issue over the last 14 days, down from a relatively high % of runs. The instances I looked at seemed to have underlying cluster etcd issues, which could manifest as stale cache in OLM (and thus see this issue).
I picked one of the issues to use in the bug title, so it's more specific than just the generic alert unit name.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409