Bug 2010989
| Summary: | [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel] | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Devan Goodwin <dgoodwin> |
| Component: | Etcd | Assignee: | Haseeb Tariq <htariq> |
| Status: | CLOSED DUPLICATE | QA Contact: | ge liu <geliu> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.10 | CC: | dwest, htariq, sippy, tjungblu, wking |
| Target Milestone: | --- | Keywords: | DeliveryBlocker |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
job=periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial
|
|
| Last Closed: | 2022-07-20 13:14:25 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Devan Goodwin
2021-10-05 17:57:52 UTC
TRT bumping to priority high after this killed an expensive aggregated job last night: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-sdn-upgrade-4.10-micro-release-openshift-release-analysis-aggregator/1463745865125466112 In there the 5 failed jobs all had something akin to: alert etcdHighNumberOfLeaderChanges fired for 120 seconds with labels: {endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-64dgic8n-a9175-gmh6z-master-1", service="etcd", severity="warning"} This kicked off *really* dramatically late on Thursday Nov 18th: https://search.ci.openshift.org/chart?search=alert+etcdHighNumberOfLeaderChanges+fired+for&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job It's hitting on aws, azure and gcp: https://search.ci.openshift.org/?search=alert+etcdHighNumberOfLeaderChanges+fired+for&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Focusing on https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1463745856736858112 which is one of the jobs aggregated beneath the job I mentioned in comment #5. (for those unaware, aggregated jobs = fire off 10 job runs and compare test pass rates with historical avgs) In the event filters you can see a sea of weird etcd events: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1463745856736858112/artifacts/e2e-gcp-upgrade/gather-must-gather/artifacts/event-filter.html Large numbers of events where a member "is unhealthy" spanning an entire hour and multiple masters. Also a defrag attempted for one at some point. *** Bug 2053380 has been marked as a duplicate of this bug. *** Poking at [1]:
: [sig-arch][bz-etcd][Late] Alerts alert/etcdHighNumberOfLeaderChanges should not be at or above info [Suite:openshift/conformance/parallel] 2s
{ fail [github.com/openshift/origin/test/extended/prometheus/alerts.go:24]: Apr 5 20:05:44.316: etcdHighNumberOfLeaderChanges was at or above info for at least 10m57s on platformidentification.JobType{Release:"4.11", FromRelease:"", Platform:"aws", Network:"sdn", Topology:"ha"} (maxAllowed=6m12.6s): pending for 11m6s, firing for 10m57s:
Apr 05 18:55:09.000 - 239s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-175-63.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-175-63.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}
Apr 05 18:55:09.000 - 179s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-185-3.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-185-3.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}
Apr 05 18:55:09.000 - 239s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-199-213.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-199-213.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}}
: [bz-etcd][invariant] alert/etcdHighNumberOfLeaderChanges should not be at or above info 0s
{ etcdHighNumberOfLeaderChanges was at or above info for at least 10m57s on platformidentification.JobType{Release:"4.11", FromRelease:"", Platform:"aws", Network:"sdn", Topology:"ha"} (maxAllowed=6m12.6s): pending for 11m3s, firing for 10m57s:
Apr 05 18:55:08.790 - 179s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-185-3.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-185-3.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}
Apr 05 18:55:08.790 - 239s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-175-63.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-175-63.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}
Apr 05 18:55:08.790 - 239s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-199-213.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-199-213.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}}
Dropping into PromeCIeus:
etcd_server_leader_changes_seen_total{job=~".*etcd.*"}
showed a whole bunch of activity early in the run, while the static-pod controller rolled out rounds of config changes:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1511407039413227520/artifacts/e2e-aws-serial/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-etcd-operator" and .reason == "RevisionTriggered")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .reason + ": " + .message'
2022-04-05T18:29:29Z RevisionTriggered: new revision 1 triggered by "configmap \"etcd-pod\" not found"
2022-04-05T18:30:26Z RevisionTriggered: new revision 2 triggered by "configmap \"etcd-pod\" not found"
2022-04-05T18:30:30Z RevisionTriggered: new revision 2 triggered by "configmap \"etcd-pod-1\" not found"
2022-04-05T18:30:46Z RevisionTriggered: new revision 2 triggered by "configmap \"etcd-pod-1\" not found"
2022-04-05T18:32:10Z RevisionTriggered: new revision 3 triggered by "configmap/etcd-endpoints has changed"
2022-04-05T18:32:23Z RevisionTriggered: new revision 4 triggered by "configmap/etcd-pod has changed"
2022-04-05T18:33:45Z RevisionTriggered: new revision 5 triggered by "configmap/etcd-endpoints has changed"
2022-04-05T18:33:56Z RevisionTriggered: new revision 5 triggered by "configmap/etcd-pod has changed,configmap/etcd-endpoints has changed"
2022-04-05T18:34:52Z RevisionTriggered: new revision 6 triggered by "configmap/etcd-endpoints has changed"
2022-04-05T18:35:02Z RevisionTriggered: new revision 6 triggered by "configmap/etcd-pod has changed,configmap/etcd-endpoints has changed"
2022-04-05T18:46:26Z RevisionTriggered: new revision 7 triggered by "configmap/etcd-pod has changed"
Maybe the static pod controller should grow a new metric for config revision, and the etcdHighNumberOfLeaderChanges expr could be updated to say "when the leader churn is higher than what I'd expect given the revision churn"?
[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1511407039413227520
|