[sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel] is failing periodically in CI, see: https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=%5Bsig-etcd%5D%20etcd%20leader%20changes%20are%20not%20excessive%20%5BLate%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D This is one of only 4 tests failing on 4.10 aws serial, and TRT would like this job to become reliable. https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-blocking#periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial Sample failing run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1445319167623303168
TRT bumping to priority high after this killed an expensive aggregated job last night: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-sdn-upgrade-4.10-micro-release-openshift-release-analysis-aggregator/1463745865125466112 In there the 5 failed jobs all had something akin to: alert etcdHighNumberOfLeaderChanges fired for 120 seconds with labels: {endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-64dgic8n-a9175-gmh6z-master-1", service="etcd", severity="warning"} This kicked off *really* dramatically late on Thursday Nov 18th: https://search.ci.openshift.org/chart?search=alert+etcdHighNumberOfLeaderChanges+fired+for&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job It's hitting on aws, azure and gcp: https://search.ci.openshift.org/?search=alert+etcdHighNumberOfLeaderChanges+fired+for&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Focusing on https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1463745856736858112 which is one of the jobs aggregated beneath the job I mentioned in comment #5. (for those unaware, aggregated jobs = fire off 10 job runs and compare test pass rates with historical avgs) In the event filters you can see a sea of weird etcd events: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade/1463745856736858112/artifacts/e2e-gcp-upgrade/gather-must-gather/artifacts/event-filter.html Large numbers of events where a member "is unhealthy" spanning an entire hour and multiple masters. Also a defrag attempted for one at some point.
*** Bug 2053380 has been marked as a duplicate of this bug. ***
Poking at [1]: : [sig-arch][bz-etcd][Late] Alerts alert/etcdHighNumberOfLeaderChanges should not be at or above info [Suite:openshift/conformance/parallel] 2s { fail [github.com/openshift/origin/test/extended/prometheus/alerts.go:24]: Apr 5 20:05:44.316: etcdHighNumberOfLeaderChanges was at or above info for at least 10m57s on platformidentification.JobType{Release:"4.11", FromRelease:"", Platform:"aws", Network:"sdn", Topology:"ha"} (maxAllowed=6m12.6s): pending for 11m6s, firing for 10m57s: Apr 05 18:55:09.000 - 239s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-175-63.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-175-63.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"} Apr 05 18:55:09.000 - 179s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-185-3.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-185-3.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"} Apr 05 18:55:09.000 - 239s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-199-213.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-199-213.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}} : [bz-etcd][invariant] alert/etcdHighNumberOfLeaderChanges should not be at or above info 0s { etcdHighNumberOfLeaderChanges was at or above info for at least 10m57s on platformidentification.JobType{Release:"4.11", FromRelease:"", Platform:"aws", Network:"sdn", Topology:"ha"} (maxAllowed=6m12.6s): pending for 11m3s, firing for 10m57s: Apr 05 18:55:08.790 - 179s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-185-3.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-185-3.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"} Apr 05 18:55:08.790 - 239s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-175-63.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-175-63.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"} Apr 05 18:55:08.790 - 239s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-199-213.us-west-2.compute.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-199-213.us-west-2.compute.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}} Dropping into PromeCIeus: etcd_server_leader_changes_seen_total{job=~".*etcd.*"} showed a whole bunch of activity early in the run, while the static-pod controller rolled out rounds of config changes: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1511407039413227520/artifacts/e2e-aws-serial/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-etcd-operator" and .reason == "RevisionTriggered")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .reason + ": " + .message' 2022-04-05T18:29:29Z RevisionTriggered: new revision 1 triggered by "configmap \"etcd-pod\" not found" 2022-04-05T18:30:26Z RevisionTriggered: new revision 2 triggered by "configmap \"etcd-pod\" not found" 2022-04-05T18:30:30Z RevisionTriggered: new revision 2 triggered by "configmap \"etcd-pod-1\" not found" 2022-04-05T18:30:46Z RevisionTriggered: new revision 2 triggered by "configmap \"etcd-pod-1\" not found" 2022-04-05T18:32:10Z RevisionTriggered: new revision 3 triggered by "configmap/etcd-endpoints has changed" 2022-04-05T18:32:23Z RevisionTriggered: new revision 4 triggered by "configmap/etcd-pod has changed" 2022-04-05T18:33:45Z RevisionTriggered: new revision 5 triggered by "configmap/etcd-endpoints has changed" 2022-04-05T18:33:56Z RevisionTriggered: new revision 5 triggered by "configmap/etcd-pod has changed,configmap/etcd-endpoints has changed" 2022-04-05T18:34:52Z RevisionTriggered: new revision 6 triggered by "configmap/etcd-endpoints has changed" 2022-04-05T18:35:02Z RevisionTriggered: new revision 6 triggered by "configmap/etcd-pod has changed,configmap/etcd-endpoints has changed" 2022-04-05T18:46:26Z RevisionTriggered: new revision 7 triggered by "configmap/etcd-pod has changed" Maybe the static pod controller should grow a new metric for config revision, and the etcdHighNumberOfLeaderChanges expr could be updated to say "when the leader churn is higher than what I'd expect given the revision churn"? [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1511407039413227520