Bug 1811343

Summary: GCP e2e release run reported high number of etcd changes alert
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED NOTABUG QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: skolicha, wking
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-03 20:33:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2020-03-07 21:13:39 UTC
GCP has never had a problem with etcd leadership changes, this needs triage to understand whether our alert is too tight, a bug happened, or a recent etcd change is now causing more leaser election changes. Since this is a significant source of recent problems, marking it as high and considering it a release blocker unless triaged.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/1927

[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel/minimal] expand_less	1m22s
fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:163]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"etcdHighNumberOfLeaderChanges\",\"alertstate\":\"firing\",\"job\":\"etcd\",\"severity\":\"warning\"},\"value\":[1583543075.94,\"1\"]}]",
        },
    }
to be empty

Comment 2 Suresh Kolichala 2020-03-11 19:15:36 UTC
Since 4.4 code is frozen, moving this BZ to 4.5.

Comment 4 W. Trevor King 2020-06-25 20:15:37 UTC
We saw this in a production cluster moving from 4.4.4 to 4.4.10.  Getting a must-gather now, but will attach to bug 1825000, which seems like the generic ticket tracking twitching etcdHighNumberOfLeaderChanges.