Bug 1929944 - The etcdInsufficientMembers alert fires incorrectly when any instance is down and not when quorum is lost
Summary: The etcdInsufficientMembers alert fires incorrectly when any instance is down...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Clayton Coleman
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 1930224 1930226 1930876
TreeView+ depends on / blocked
 
Reported: 2021-02-18 01:58 UTC by Clayton Coleman
Modified: 2021-07-27 22:47 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Alert configuration based on upstream etcd is not suitable for Openshift configuration. Consequence: etcdInsufficientMembers alert fires incorrectly Fix: Change the expression to include pod label as well as instance label in the query. Result: alert fires only when the quorum is lost.
Clone Of:
: 1930224 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:45:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1064 0 None open Bug 1929944: etcdInsufficientMembers is wrong when etcd is in a pod 2021-02-19 01:52:54 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:47:27 UTC

Description Clayton Coleman 2021-02-18 01:58:30 UTC
etcdInsufficientMembers is supposed to fire when quorum is potentially lost. However, the difference in how upstream expects to configure etcd (instance label unique) and how OpenShift configures etcd (running in pods, so instance and pod label are unique) results in the alert firing spuriously during upgrades. This means the alert fires too eagerly.

The alert, once corrected, should only fire if we have reason to believe quorum is lost i.e. a majority of instances are down (must have (N+1)/2 instances up) where n is inferred from the expected number of scrape targets. The alert should have a better description and suggest possible areas to investigate, namely down control plane nodes or broken networking.

A runbook addition will come later.

Comment 2 Clayton Coleman 2021-02-18 14:18:31 UTC
Will be backported to 4.7 and 4.6

Comment 3 Michael Gugino 2021-02-18 15:33:20 UTC
Need to make sure we account for this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1880759

Also, I suggest renaming this alert to EtcdQuorumLost to more clearly demonstrate the impact.  InsufficientMembers doesn't have much context.

Comment 9 errata-xmlrpc 2021-07-27 22:45:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.