2010989 – [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]

Bug 2010989 - [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]

Summary: [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conf...

Keywords:
Status:	CLOSED DUPLICATE of bug 2092880
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Haseeb Tariq
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2053380 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-05 17:57 UTC by Devan Goodwin
Modified:	2022-07-20 13:14 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial
Last Closed:	2022-07-20 13:14:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Devan Goodwin 2021-10-05 17:57:52 UTC

[sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]

is failing periodically in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=%5Bsig-etcd%5D%20etcd%20leader%20changes%20are%20not%20excessive%20%5BLate%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D

This is one of only 4 tests failing on 4.10 aws serial, and TRT would like this job to become reliable. 

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-blocking#periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial

Sample failing run:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1445319167623303168

Comment 5 Devan Goodwin 2021-11-25 12:43:14 UTC

TRT bumping to priority high after this killed an expensive aggregated job last night: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-sdn-upgrade-4.10-micro-release-openshift-release-analysis-aggregator/1463745865125466112

In there the 5 failed jobs all had something akin to:

alert etcdHighNumberOfLeaderChanges fired for 120 seconds with labels: {endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-64dgic8n-a9175-gmh6z-master-1", service="etcd", severity="warning"}

This kicked off *really* dramatically late on Thursday Nov 18th: https://search.ci.openshift.org/chart?search=alert+etcdHighNumberOfLeaderChanges+fired+for&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

It's hitting on aws, azure and gcp: https://search.ci.openshift.org/?search=alert+etcdHighNumberOfLeaderChanges+fired+for&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 6 Devan Goodwin 2021-11-25 13:08:23 UTC

Focusing on https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1463745856736858112 which is one of the jobs aggregated beneath the job I mentioned in comment #5. (for those unaware, aggregated jobs = fire off 10 job runs and compare test pass rates with historical avgs)

In the event filters you can see a sea of weird etcd events: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1463745856736858112/artifacts/e2e-gcp-upgrade/gather-must-gather/artifacts/event-filter.html

Large numbers of events where a member "is unhealthy" spanning an entire hour and multiple masters.

Also a defrag attempted for one at some point.

Comment 7 W. Trevor King 2022-04-05 23:04:10 UTC

*** Bug 2053380 has been marked as a duplicate of this bug. ***

Comment 8 W. Trevor King 2022-04-05 23:12:56 UTC

Poking at [1]:

  : [sig-arch][bz-etcd][Late] Alerts alert/etcdHighNumberOfLeaderChanges should not be at or above info [Suite:openshift/conformance/parallel]	2s
    {  fail [github.com/openshift/origin/test/extended/prometheus/alerts.go:24]: Apr  5 20:05:44.316: etcdHighNumberOfLeaderChanges was at or above info for at least 10m57s on platformidentification.JobType{Release:"4.11", FromRelease:"", Platform:"aws", Network:"sdn", Topology:"ha"} (maxAllowed=6m12.6s): pending for 11m6s, firing for 10m57s:

    Apr 05 18:55:09.000 - 239s  W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-175-63.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-175-63.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}
    Apr 05 18:55:09.000 - 179s  W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-185-3.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-185-3.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}
    Apr 05 18:55:09.000 - 239s  W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-199-213.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-199-213.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}}

  : [bz-etcd][invariant] alert/etcdHighNumberOfLeaderChanges should not be at or above info	0s
    {  etcdHighNumberOfLeaderChanges was at or above info for at least 10m57s on platformidentification.JobType{Release:"4.11", FromRelease:"", Platform:"aws", Network:"sdn", Topology:"ha"} (maxAllowed=6m12.6s): pending for 11m3s, firing for 10m57s:

    Apr 05 18:55:08.790 - 179s  W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-185-3.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-185-3.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}
    Apr 05 18:55:08.790 - 239s  W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-175-63.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-175-63.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}
    Apr 05 18:55:08.790 - 239s  W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-199-213.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-199-213.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}}

Dropping into PromeCIeus:

  etcd_server_leader_changes_seen_total{job=~".*etcd.*"}

showed a whole bunch of activity early in the run, while the static-pod controller rolled out rounds of config changes:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1511407039413227520/artifacts/e2e-aws-serial/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-etcd-operator" and .reason == "RevisionTriggered")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .reason + ": " + .message'
  2022-04-05T18:29:29Z RevisionTriggered: new revision 1 triggered by "configmap \"etcd-pod\" not found"  
  2022-04-05T18:30:26Z RevisionTriggered: new revision 2 triggered by "configmap \"etcd-pod\" not found"
  2022-04-05T18:30:30Z RevisionTriggered: new revision 2 triggered by "configmap \"etcd-pod-1\" not found"
  2022-04-05T18:30:46Z RevisionTriggered: new revision 2 triggered by "configmap \"etcd-pod-1\" not found"
  2022-04-05T18:32:10Z RevisionTriggered: new revision 3 triggered by "configmap/etcd-endpoints has changed"
  2022-04-05T18:32:23Z RevisionTriggered: new revision 4 triggered by "configmap/etcd-pod has changed"
  2022-04-05T18:33:45Z RevisionTriggered: new revision 5 triggered by "configmap/etcd-endpoints has changed"
  2022-04-05T18:33:56Z RevisionTriggered: new revision 5 triggered by "configmap/etcd-pod has changed,configmap/etcd-endpoints has changed"
  2022-04-05T18:34:52Z RevisionTriggered: new revision 6 triggered by "configmap/etcd-endpoints has changed"
  2022-04-05T18:35:02Z RevisionTriggered: new revision 6 triggered by "configmap/etcd-pod has changed,configmap/etcd-endpoints has changed"
  2022-04-05T18:46:26Z RevisionTriggered: new revision 7 triggered by "configmap/etcd-pod has changed"

Maybe the static pod controller should grow a new metric for config revision, and the etcdHighNumberOfLeaderChanges expr could be updated to say "when the leader churn is higher than what I'd expect given the revision churn"?

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1511407039413227520

Note You need to log in before you can comment on or make changes to this bug.