Bug 1811343 - GCP e2e release run reported high number of etcd changes alert
Summary: GCP e2e release run reported high number of etcd changes alert
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-07 21:13 UTC by Clayton Coleman
Modified: 2020-06-25 20:15 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-03 20:33:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Clayton Coleman 2020-03-07 21:13:39 UTC
GCP has never had a problem with etcd leadership changes, this needs triage to understand whether our alert is too tight, a bug happened, or a recent etcd change is now causing more leaser election changes. Since this is a significant source of recent problems, marking it as high and considering it a release blocker unless triaged.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/1927

[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel/minimal] expand_less	1m22s
fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:163]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"etcdHighNumberOfLeaderChanges\",\"alertstate\":\"firing\",\"job\":\"etcd\",\"severity\":\"warning\"},\"value\":[1583543075.94,\"1\"]}]",
        },
    }
to be empty

Comment 2 Suresh Kolichala 2020-03-11 19:15:36 UTC
Since 4.4 code is frozen, moving this BZ to 4.5.

Comment 4 W. Trevor King 2020-06-25 20:15:37 UTC
We saw this in a production cluster moving from 4.4.4 to 4.4.10.  Getting a must-gather now, but will attach to bug 1825000, which seems like the generic ticket tracking twitching etcdHighNumberOfLeaderChanges.


Note You need to log in before you can comment on or make changes to this bug.