1825000 – [4.3 upgrade][alert] etcdHighNumberOfLeaderChanges: etcd cluster "etcd": 7.5 leader changes within the last 15 minutes.

Bug 1825000 - [4.3 upgrade][alert] etcdHighNumberOfLeaderChanges: etcd cluster "etcd": 7.5 leader changes within the last 15 minutes.

Summary: [4.3 upgrade][alert] etcdHighNumberOfLeaderChanges: etcd cluster "etcd": 7.5 ...

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-16 19:50 UTC by Hongkai Liu
Modified:	2020-10-08 13:58 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-14 14:47:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hongkai Liu 2020-04-16 19:50:52 UTC

During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion.

oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true

Eventually upgrade was completed successfully (which is so nice).
But those alerts and messages are too frightening.

I would like to create a bug for each of those and feel better for the next upgrade.

https://coreos.slack.com/archives/CHY2E1BL4/p1587060465461100


[FIRING:1] etcdHighNumberOfLeaderChanges etcd (openshift-monitoring/k8s warning)
etcd cluster "etcd": 7.5 leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.

must-gather after upgrade:
http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/

Comment 2 W. Trevor King 2020-04-17 17:31:33 UTC

Having a control-plane node go down during an update, where we are rolling out new machine-os-content, is an expected part of release updates.  We should not be alerting during healthy updates, and are growing CI guards to fail clusters that alert on updates [1].

[1]: https://github.com/openshift/origin/pull/24786

Comment 3 Michal Fojtik 2020-05-18 07:29:34 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 8 Dan Mace 2020-05-20 18:32:51 UTC

We want to improve the alerting here, but we can't do it in 4.5. Moving to 4.6.

Comment 9 W. Trevor King 2020-05-21 04:54:57 UTC

Just pointing out, in response to internal discussion, that this is not update-specific.  If a new MachineConfig is rolled out, and the machine-config operator decides to reboot the whole control plane, that is going to trigger the same amount of leader-changes as an update (with the possible exception of any special handling that is part of the 4.3 -> 4.4 update and the etcd-operator taking the wheel).  But we want "reboot the control plane" to not trigger alerts unless something abnormal is happening.  I dunno if there's any way to annotate the leader election metrics with because-my-node-is-being-drained labels or something, that the alert can ignore?  So we can keep it nice and strict about because-my-disk-is-slow leader changes.

Comment 11 Sam Batschelet 2020-06-20 11:40:26 UTC

We did not have time to address this in 4.5 but we have review of metrics definitions planned for 4.6

Comment 13 W. Trevor King 2020-06-25 23:06:01 UTC

In this space:

* [1], picking up [2] in master (again, see below about the 4.5+ regression).
* [3], raising the threshold to 4, to avoid tripping on a healthy rolling reboot of our 3-node clusters.  Although we could still trip if someone had multiple rolling reboots (e.g. by deploying multiple MachineConfig changes, since the machine-config daemons currently reboot nodes after each config change).  We'd still need follow-up work to land this upstream mixin change in monitoring.

Trying to understand the initial 4.3 "7.5 leader changes within the last 15 minutes" complaint, it seems to have landed in [4] as a backport of [5].  And the alert change for [4,5] seem to match the one I'm floating (again?) for master in [1].  We seem to have regressed in [6], when we moved to pin etcd release-3.4 (which does not seem to receive mixin backports) instead of tracking etcd's master mixin [7].  The regression only affects 4.5+:

$ git --no-pager grep -A9 etcdHighNumberOfLeaderChanges origin/release-4.4 -- assets/prometheus-k8s/rules.yaml
origin/release-4.4:assets/prometheus-k8s/rules.yaml:    - alert: etcdHighNumberOfLeaderChanges
origin/release-4.4:assets/prometheus-k8s/rules.yaml-      annotations:
origin/release-4.4:assets/prometheus-k8s/rules.yaml-        message: 'etcd cluster "{{ $labels.job }}": {{ $value }} leader changes within
origin/release-4.4:assets/prometheus-k8s/rules.yaml-          the last 15 minutes. Frequent elections may be a sign of insufficient resources,
origin/release-4.4:assets/prometheus-k8s/rules.yaml-          high network latency, or disruptions by other components and should be investigated.'
origin/release-4.4:assets/prometheus-k8s/rules.yaml-      expr: |
origin/release-4.4:assets/prometheus-k8s/rules.yaml-        increase((max by (job) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m]) >= 3
origin/release-4.4:assets/prometheus-k8s/rules.yaml-      for: 5m
origin/release-4.4:assets/prometheus-k8s/rules.yaml-      labels:
origin/release-4.4:assets/prometheus-k8s/rules.yaml-        severity: warning
$ git --no-pager grep -A7 etcdHighNumberOfLeaderChanges origin/release-4.5 -- assets/prometheus-k8s/rules.yaml
origin/release-4.5:assets/prometheus-k8s/rules.yaml:    - alert: etcdHighNumberOfLeaderChanges
origin/release-4.5:assets/prometheus-k8s/rules.yaml-      annotations:
origin/release-4.5:assets/prometheus-k8s/rules.yaml-        message: 'etcd cluster "{{ $labels.job }}": instance {{ $labels.instance }} has seen {{ $value }} leader changes within the last 30 minutes.'
origin/release-4.5:assets/prometheus-k8s/rules.yaml-      expr: |
origin/release-4.5:assets/prometheus-k8s/rules.yaml-        rate(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}[15m]) > 3
origin/release-4.5:assets/prometheus-k8s/rules.yaml-      for: 15m
origin/release-4.5:assets/prometheus-k8s/rules.yaml-      labels:
origin/release-4.5:assets/prometheus-k8s/rules.yaml-        severity: warning

[1]: https://github.com/openshift/cluster-monitoring-operator/pull/827
[2]: https://github.com/etcd-io/etcd/pull/11448
[3]: https://github.com/etcd-io/etcd/pull/12080
[4]: https://github.com/openshift/cluster-monitoring-operator/pull/595
[5]: https://github.com/openshift/cluster-monitoring-operator/pull/591
[6]: https://github.com/openshift/cluster-monitoring-operator/pull/760
[7]: https://github.com/openshift/cluster-monitoring-operator/pull/827#issuecomment-649848483

Comment 14 W. Trevor King 2020-06-25 23:20:08 UTC

In case it's not clear from the above wall of text, cmo#827 will undo master/4.5's cmo#760 regression, after which the master alert will look just like the 4.3 alert that opened this bug.  etcd#12080 is a move to reduce false-positives, although with some caveats described in the PR.  It would still have tripped on the bug's initial "7.5 leader changes within the last 15 minutes", which is good (I dunno why we had so many elections then).  It would not have tripped on the GCP 4.4.4 -> 4.4.10 update today where "3.2142857142857144 leader changes within the last 15 minutes" fired.  The fractional changes even with the 'increase' logic are Prometheus extrapolating [1].

[1]: https://prometheus.io/docs/prometheus/latest/querying/functions/#increase

Comment 17 Dan Mace 2020-07-14 14:47:45 UTC

This is now tracked in https://issues.redhat.com/browse/ETCD-96.

Note You need to log in before you can comment on or make changes to this bug.