Bug 1811343

Summary:	GCP e2e release run reported high number of etcd changes alert
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED NOTABUG	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	skolicha, wking
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-03 20:33:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2020-03-07 21:13:39 UTC

GCP has never had a problem with etcd leadership changes, this needs triage to understand whether our alert is too tight, a bug happened, or a recent etcd change is now causing more leaser election changes. Since this is a significant source of recent problems, marking it as high and considering it a release blocker unless triaged.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/1927

[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel/minimal] expand_less	1m22s
fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:163]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"etcdHighNumberOfLeaderChanges\",\"alertstate\":\"firing\",\"job\":\"etcd\",\"severity\":\"warning\"},\"value\":[1583543075.94,\"1\"]}]",
        },
    }
to be empty

Comment 2 Suresh Kolichala 2020-03-11 19:15:36 UTC

Since 4.4 code is frozen, moving this BZ to 4.5.

Comment 4 W. Trevor King 2020-06-25 20:15:37 UTC

We saw this in a production cluster moving from 4.4.4 to 4.4.10.  Getting a must-gather now, but will attach to bug 1825000, which seems like the generic ticket tracking twitching etcdHighNumberOfLeaderChanges.