1811343 – GCP e2e release run reported high number of etcd changes alert

Bug 1811343 - GCP e2e release run reported high number of etcd changes alert

Summary: GCP e2e release run reported high number of etcd changes alert

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-07 21:13 UTC by Clayton Coleman
Modified:	2020-06-25 20:15 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-03 20:33:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Clayton Coleman 2020-03-07 21:13:39 UTC

GCP has never had a problem with etcd leadership changes, this needs triage to understand whether our alert is too tight, a bug happened, or a recent etcd change is now causing more leaser election changes. Since this is a significant source of recent problems, marking it as high and considering it a release blocker unless triaged.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/1927

[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel/minimal] expand_less	1m22s
fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:163]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"etcdHighNumberOfLeaderChanges\",\"alertstate\":\"firing\",\"job\":\"etcd\",\"severity\":\"warning\"},\"value\":[1583543075.94,\"1\"]}]",
        },
    }
to be empty

Comment 2 Suresh Kolichala 2020-03-11 19:15:36 UTC

Since 4.4 code is frozen, moving this BZ to 4.5.

Comment 4 W. Trevor King 2020-06-25 20:15:37 UTC

We saw this in a production cluster moving from 4.4.4 to 4.4.10.  Getting a must-gather now, but will attach to bug 1825000, which seems like the generic ticket tracking twitching etcdHighNumberOfLeaderChanges.

Note You need to log in before you can comment on or make changes to this bug.